MKT 326 · Assignment 2

Predicting
Buying Decisions

CourseMKT 326 — Marketing Analytics
DataFlorence Books — 50,000 Customers
LanguageR
RLogistic RegressionLinear Regression sqldfggplot2RFM AnalysisDecile Profiling
Analysis R Code

Project Overview

Florence Books conducted a 50,000-customer trial of its online visual e-book club and wants to know which customers are most likely to subscribe. Using purchase history from offline stores — including total spend, recency, book genre preferences, and demographics — I built two predictive models: a linear regression to understand offline spending behavior, and a logistic regression to predict subscription probability. I then profiled the top and bottom predicted-subscriber deciles to design a targeted email campaign.

50KCustomers in Trial
9.0%Overall Subscription Rate
4,522Total Subscribers

SQL Descriptive Analysis

I used sqldf in R to run SQL queries directly against the loaded dataframe, replicating the workflow from the course demos. The three core descriptive statistics below establish baselines for the customer population before modeling.

Descriptive Statistics — 50,000 Florence Books Customers
MetricMeanStd DevInterpretation
Total Offline $ Spent$208.32$101.36Wide spending range; high variance signals segmentation opportunity
Total # Books Purchased3.893.48Most customers are light buyers; power buyers skew the mean upward
Months Since Last Purchase12.368.15Average customer hasn't purchased in ~1 year; recency is a key risk factor

Subscriptions by Gender

GenderTotal CustomersSubscribersSubscription Rate
Female33,3022,3897.2%
Male / Non-Binary16,6982,13312.8%
Finding — Gender Gap

Male/NB customers subscribe at nearly 1.8× the rate of female customers (12.8% vs. 7.2%). Female customers make up the large majority of the customer base (66.6%) but show lower subscription propensity. This counterintuitive finding is confirmed and explained by the logistic regression — it's not simply about gender, but about what male/NB customers buy (more art books, geography books) which are the strongest predictors of subscription.

Subscription Rate by Gender

Male/NB customers subscribe at 12.8% vs. 7.2% for female customers — a meaningful gap that the logistic model will help explain through book genre preferences and recency differences between groups.


Linear Regression — Predicting Offline Spend

The linear regression model predicts a customer's total offline dollar spend from their gender, tenure (months since first purchase), and book category purchase counts. This establishes which customer attributes are associated with higher-value offline buyers — the foundation for understanding who Florence Books should prioritize for the online club.

total_ ~ IsFemale + first + child + youth + cook + do_it + refernce + art + geog

Linear Regression — Coefficients (Significant Variables)

Every book category adds approximately $14–16 to a customer's total offline spend. The effect is remarkably consistent across genres, suggesting that the key driver of offline spend is volume of purchases — customers who buy more books across any category spend more — rather than any specific genre preference. IsFemale and months-since-first-purchase are statistically insignificant.

VariableCoefficientp-valueSignificant?Interpretation
Intercept$149.63<0.001***Base spend for a male customer with no book purchases
IsFemale+$0.680.433NoGender does not significantly predict offline spend
first (tenure)−$0.030.382NoMonths since first purchase is not a significant spending predictor
Children's Books+$15.26<0.001***Each children's book adds ~$15 to total spend
Young Adult+$15.42<0.001***Each YA book adds ~$15 to total spend
Cookbooks+$15.66<0.001***Each cookbook adds ~$16 to total spend
DIY Books+$15.01<0.001***Each DIY book adds ~$15 to total spend
Reference+$14.51<0.001***Each reference book adds ~$14.50 to total spend
Art Books+$14.46<0.001***Each art book adds ~$14.50 to total spend
Geography/Travel+$15.18<0.001***Each geography book adds ~$15 to total spend

R² = 0.2656 — the model explains about 27% of variance in offline spend, which is reasonable given we're predicting spending purely from transaction counts without pricing or channel data.


Logistic Regression — Predicting Subscription

The logistic regression models the log-odds of subscribing to the online book club as a function of recency (months since last offline purchase), monetary value (total offline spend), gender, and book genre purchase counts. Unlike linear regression, logistic regression outputs a probability between 0 and 1, making it ideal for binary classification like subscription prediction.

subscribe ~ last + total_ + IsFemale + child + youth + cook + do_it + refernce + art + geog

Logistic Regression — Odds Ratios

Odds ratios above 1.0 increase subscription probability; below 1.0 decrease it. Art books (OR = 3.18) and geography/travel books (OR = 1.78) are the strongest positive predictors. DIY books (OR = 0.58) and being female (OR = 0.47) are the strongest negative predictors. All variables are highly significant (p < 0.001).

VariableCoefficientOdds Ratiop-valueInterpretation
last (recency)−0.0950.910<0.001Each additional month since last purchase reduces subscription odds by ~9% — recency matters
total_ (spend)+0.0011.001<0.001Higher offline spenders are marginally more likely to subscribe
IsFemale−0.7610.467<0.001Female customers are 53% less likely to subscribe than M/NB, controlling for other factors
Children's Books−0.1860.830<0.001Each children's book reduces subscription odds by 17%
Young Adult−0.1130.893<0.001Each YA book reduces odds by 11%
Cookbooks−0.2700.763<0.001Each cookbook reduces odds by 24%
DIY Books−0.5390.583<0.001Each DIY book reduces odds by 42% — strongest negative genre signal
Reference+0.2351.265<0.001Each reference book increases odds by 26%
Art Books+1.1563.176<0.001Each art book more than triples subscription odds — strongest positive signal
Geography/Travel+0.5741.776<0.001Each geography book increases odds by 78%
Key Logistic Regression Insights

The model tells a clear story: the ideal e-book club subscriber is a recent M/NB customer who buys art and geography books offline. Art books have a remarkably strong signal (OR = 3.18), possibly because visual/artistic content translates especially well to the visual e-book format. DIY and cookbook buyers, by contrast, likely prefer physical books for practical reference use — a digital subscription has less utility for them. Recency is a critical lever: every month of inactivity meaningfully reduces conversion probability.


Decile Profiling — Top vs. Bottom Predicted Subscribers

Using the logistic model's predicted probabilities, I ranked all 50,000 customers into 10 equal deciles (5,000 each). I then profiled the top decile (highest predicted subscription probability) and bottom decile (lowest) to understand who to target and who to deprioritize in a promotional campaign.

4.28× Top Decile Lift
38.7% Top Decile Actual Sub Rate
9.0% Overall Subscription Rate
Top Decile — Target Segment
38.7%

Actual subscription rate (n=5,000)

Avg predicted prob38.6%
Avg total offline spend$257.35
Avg recency (months)7.2 months
Avg books purchased6.5 books
% Female41.9%
Top genreArt books
Bottom Decile — Do Not Target
0.8%

Actual subscription rate (n=5,000)

Avg predicted prob0.65%
Avg total offline spend$204.34
Avg recency (months)25.9 months
Avg books purchased4.2 books
% Female78.2%
Top genreDIY / Cookbooks
Actual Subscription Rate by Predicted Decile

The model shows strong monotonic lift — predicted deciles reliably rank customers by actual subscription propensity. The top decile (decile 10) achieves 38.7% actual subscription vs. 0.8% in the bottom decile, a 48× spread. This steep gradient validates that the logistic model is meaningfully discriminating between high and low probability subscribers.

Top decile profile summary: Recent purchasers (7.2 months avg) who buy art books, are disproportionately M/NB (58.1%), and have above-average offline spend ($257). They are active, engaged book buyers who likely gravitate toward visual and experiential content — exactly what a visual e-book subscription offers. Bottom decile profile: Lapsed customers (25.9 months since last purchase) who skew female (78.2%) and favor DIY and cookbooks — practical physical-format readers who find less value in a visual digital subscription.


Email Campaign Design

Based on the model outputs, I designed a targeted email campaign to convert high-probability customers. The campaign has two tiers: a primary outreach to the top two deciles (~10,000 customers) and a reactivation campaign for middle deciles who haven't purchased recently.

Tier 1 — Top Two Deciles (Highest-Probability Subscribers)

Tier 2 — Middle Deciles (Recency-Recoverable Customers)

Marketing Metrics to Track

MetricTargetWhy It Matters
Email open rate>22%Validates subject line and send-time decisions
Click-through rate (CTR)>4%Measures content relevance and CTA strength
Trial activation rate>15% of clicksConversion from click to subscription signup
Month-2 retention rate>60%Whether trial subscribers stay past the free month
Lift vs. control group>2×Validates model effectiveness vs. random targeting
Revenue per email sentTrack over timeROI metric to justify ongoing campaign investment
← back to all projects