MKT 326 · Assignment 2
Florence Books conducted a 50,000-customer trial of its online visual e-book club and wants to know which customers are most likely to subscribe. Using purchase history from offline stores — including total spend, recency, book genre preferences, and demographics — I built two predictive models: a linear regression to understand offline spending behavior, and a logistic regression to predict subscription probability. I then profiled the top and bottom predicted-subscriber deciles to design a targeted email campaign.
I used sqldf in R to run SQL queries directly against the loaded dataframe, replicating the workflow from the course demos. The three core descriptive statistics below establish baselines for the customer population before modeling.
| Metric | Mean | Std Dev | Interpretation |
|---|---|---|---|
| Total Offline $ Spent | $208.32 | $101.36 | Wide spending range; high variance signals segmentation opportunity |
| Total # Books Purchased | 3.89 | 3.48 | Most customers are light buyers; power buyers skew the mean upward |
| Months Since Last Purchase | 12.36 | 8.15 | Average customer hasn't purchased in ~1 year; recency is a key risk factor |
| Gender | Total Customers | Subscribers | Subscription Rate |
|---|---|---|---|
| Female | 33,302 | 2,389 | 7.2% |
| Male / Non-Binary | 16,698 | 2,133 | 12.8% |
Male/NB customers subscribe at nearly 1.8× the rate of female customers (12.8% vs. 7.2%). Female customers make up the large majority of the customer base (66.6%) but show lower subscription propensity. This counterintuitive finding is confirmed and explained by the logistic regression — it's not simply about gender, but about what male/NB customers buy (more art books, geography books) which are the strongest predictors of subscription.
Male/NB customers subscribe at 12.8% vs. 7.2% for female customers — a meaningful gap that the logistic model will help explain through book genre preferences and recency differences between groups.
The linear regression model predicts a customer's total offline dollar spend from their gender, tenure (months since first purchase), and book category purchase counts. This establishes which customer attributes are associated with higher-value offline buyers — the foundation for understanding who Florence Books should prioritize for the online club.
total_ ~ IsFemale + first + child + youth + cook + do_it + refernce + art + geog
Every book category adds approximately $14–16 to a customer's total offline spend. The effect is remarkably consistent across genres, suggesting that the key driver of offline spend is volume of purchases — customers who buy more books across any category spend more — rather than any specific genre preference. IsFemale and months-since-first-purchase are statistically insignificant.
| Variable | Coefficient | p-value | Significant? | Interpretation |
|---|---|---|---|---|
| Intercept | $149.63 | <0.001 | *** | Base spend for a male customer with no book purchases |
| IsFemale | +$0.68 | 0.433 | No | Gender does not significantly predict offline spend |
| first (tenure) | −$0.03 | 0.382 | No | Months since first purchase is not a significant spending predictor |
| Children's Books | +$15.26 | <0.001 | *** | Each children's book adds ~$15 to total spend |
| Young Adult | +$15.42 | <0.001 | *** | Each YA book adds ~$15 to total spend |
| Cookbooks | +$15.66 | <0.001 | *** | Each cookbook adds ~$16 to total spend |
| DIY Books | +$15.01 | <0.001 | *** | Each DIY book adds ~$15 to total spend |
| Reference | +$14.51 | <0.001 | *** | Each reference book adds ~$14.50 to total spend |
| Art Books | +$14.46 | <0.001 | *** | Each art book adds ~$14.50 to total spend |
| Geography/Travel | +$15.18 | <0.001 | *** | Each geography book adds ~$15 to total spend |
R² = 0.2656 — the model explains about 27% of variance in offline spend, which is reasonable given we're predicting spending purely from transaction counts without pricing or channel data.
The logistic regression models the log-odds of subscribing to the online book club as a function of recency (months since last offline purchase), monetary value (total offline spend), gender, and book genre purchase counts. Unlike linear regression, logistic regression outputs a probability between 0 and 1, making it ideal for binary classification like subscription prediction.
subscribe ~ last + total_ + IsFemale + child + youth + cook + do_it + refernce + art + geog
Odds ratios above 1.0 increase subscription probability; below 1.0 decrease it. Art books (OR = 3.18) and geography/travel books (OR = 1.78) are the strongest positive predictors. DIY books (OR = 0.58) and being female (OR = 0.47) are the strongest negative predictors. All variables are highly significant (p < 0.001).
| Variable | Coefficient | Odds Ratio | p-value | Interpretation |
|---|---|---|---|---|
| last (recency) | −0.095 | 0.910 | <0.001 | Each additional month since last purchase reduces subscription odds by ~9% — recency matters |
| total_ (spend) | +0.001 | 1.001 | <0.001 | Higher offline spenders are marginally more likely to subscribe |
| IsFemale | −0.761 | 0.467 | <0.001 | Female customers are 53% less likely to subscribe than M/NB, controlling for other factors |
| Children's Books | −0.186 | 0.830 | <0.001 | Each children's book reduces subscription odds by 17% |
| Young Adult | −0.113 | 0.893 | <0.001 | Each YA book reduces odds by 11% |
| Cookbooks | −0.270 | 0.763 | <0.001 | Each cookbook reduces odds by 24% |
| DIY Books | −0.539 | 0.583 | <0.001 | Each DIY book reduces odds by 42% — strongest negative genre signal |
| Reference | +0.235 | 1.265 | <0.001 | Each reference book increases odds by 26% |
| Art Books | +1.156 | 3.176 | <0.001 | Each art book more than triples subscription odds — strongest positive signal |
| Geography/Travel | +0.574 | 1.776 | <0.001 | Each geography book increases odds by 78% |
The model tells a clear story: the ideal e-book club subscriber is a recent M/NB customer who buys art and geography books offline. Art books have a remarkably strong signal (OR = 3.18), possibly because visual/artistic content translates especially well to the visual e-book format. DIY and cookbook buyers, by contrast, likely prefer physical books for practical reference use — a digital subscription has less utility for them. Recency is a critical lever: every month of inactivity meaningfully reduces conversion probability.
Using the logistic model's predicted probabilities, I ranked all 50,000 customers into 10 equal deciles (5,000 each). I then profiled the top decile (highest predicted subscription probability) and bottom decile (lowest) to understand who to target and who to deprioritize in a promotional campaign.
Actual subscription rate (n=5,000)
| Avg predicted prob | 38.6% |
| Avg total offline spend | $257.35 |
| Avg recency (months) | 7.2 months |
| Avg books purchased | 6.5 books |
| % Female | 41.9% |
| Top genre | Art books |
Actual subscription rate (n=5,000)
| Avg predicted prob | 0.65% |
| Avg total offline spend | $204.34 |
| Avg recency (months) | 25.9 months |
| Avg books purchased | 4.2 books |
| % Female | 78.2% |
| Top genre | DIY / Cookbooks |
The model shows strong monotonic lift — predicted deciles reliably rank customers by actual subscription propensity. The top decile (decile 10) achieves 38.7% actual subscription vs. 0.8% in the bottom decile, a 48× spread. This steep gradient validates that the logistic model is meaningfully discriminating between high and low probability subscribers.
Top decile profile summary: Recent purchasers (7.2 months avg) who buy art books, are disproportionately M/NB (58.1%), and have above-average offline spend ($257). They are active, engaged book buyers who likely gravitate toward visual and experiential content — exactly what a visual e-book subscription offers. Bottom decile profile: Lapsed customers (25.9 months since last purchase) who skew female (78.2%) and favor DIY and cookbooks — practical physical-format readers who find less value in a visual digital subscription.
Based on the model outputs, I designed a targeted email campaign to convert high-probability customers. The campaign has two tiers: a primary outreach to the top two deciles (~10,000 customers) and a reactivation campaign for middle deciles who haven't purchased recently.
| Metric | Target | Why It Matters |
|---|---|---|
| Email open rate | >22% | Validates subject line and send-time decisions |
| Click-through rate (CTR) | >4% | Measures content relevance and CTA strength |
| Trial activation rate | >15% of clicks | Conversion from click to subscription signup |
| Month-2 retention rate | >60% | Whether trial subscribers stay past the free month |
| Lift vs. control group | >2× | Validates model effectiveness vs. random targeting |
| Revenue per email sent | Track over time | ROI metric to justify ongoing campaign investment |