MKT 326 · Assignment 2

Predicting
Buying Decisions

CourseMKT 326 — Marketing Analytics

DataFlorence Books — 50,000 Customers

LanguageR

RLogistic RegressionLinear Regression sqldfggplot2RFM AnalysisDecile Profiling

Project Overview

Florence Books conducted a 50,000-customer trial of its online visual e-book club and wants to know which customers are most likely to subscribe. Using purchase history from offline stores — including total spend, recency, book genre preferences, and demographics — I built two predictive models: a linear regression to understand offline spending behavior, and a logistic regression to predict subscription probability. I then profiled the top and bottom predicted-subscriber deciles to design a targeted email campaign.

50KCustomers in Trial

9.0%Overall Subscription Rate

4,522Total Subscribers

SQL Descriptive Analysis

I used sqldf in R to run SQL queries directly against the loaded dataframe, replicating the workflow from the course demos. The three core descriptive statistics below establish baselines for the customer population before modeling.

Descriptive Statistics — 50,000 Florence Books Customers

Metric	Mean	Std Dev	Interpretation
Total Offline $ Spent	$208.32	$101.36	Wide spending range; high variance signals segmentation opportunity
Total # Books Purchased	3.89	3.48	Most customers are light buyers; power buyers skew the mean upward
Months Since Last Purchase	12.36	8.15	Average customer hasn't purchased in ~1 year; recency is a key risk factor

Subscriptions by Gender

Gender	Total Customers	Subscribers	Subscription Rate
Female	33,302	2,389	7.2%
Male / Non-Binary	16,698	2,133	12.8%

Finding — Gender Gap

Male/NB customers subscribe at nearly 1.8× the rate of female customers (12.8% vs. 7.2%). Female customers make up the large majority of the customer base (66.6%) but show lower subscription propensity. This counterintuitive finding is confirmed and explained by the logistic regression — it's not simply about gender, but about what male/NB customers buy (more art books, geography books) which are the strongest predictors of subscription.

Subscription Rate by Gender

Male/NB customers subscribe at 12.8% vs. 7.2% for female customers — a meaningful gap that the logistic model will help explain through book genre preferences and recency differences between groups.

Linear Regression — Predicting Offline Spend

The linear regression model predicts a customer's total offline dollar spend from their gender, tenure (months since first purchase), and book category purchase counts. This establishes which customer attributes are associated with higher-value offline buyers — the foundation for understanding who Florence Books should prioritize for the online club.

total_ ~ IsFemale + first + child + youth + cook + do_it + refernce + art + geog

Linear Regression — Coefficients (Significant Variables)

Every book category adds approximately $14–16 to a customer's total offline spend. The effect is remarkably consistent across genres, suggesting that the key driver of offline spend is volume of purchases — customers who buy more books across any category spend more — rather than any specific genre preference. IsFemale and months-since-first-purchase are statistically insignificant.

Variable	Coefficient	p-value	Significant?	Interpretation
Intercept	$149.63	<0.001	***	Base spend for a male customer with no book purchases
IsFemale	+$0.68	0.433	No	Gender does not significantly predict offline spend
first (tenure)	−$0.03	0.382	No	Months since first purchase is not a significant spending predictor
Children's Books	+$15.26	<0.001	***	Each children's book adds ~$15 to total spend
Young Adult	+$15.42	<0.001	***	Each YA book adds ~$15 to total spend
Cookbooks	+$15.66	<0.001	***	Each cookbook adds ~$16 to total spend
DIY Books	+$15.01	<0.001	***	Each DIY book adds ~$15 to total spend
Reference	+$14.51	<0.001	***	Each reference book adds ~$14.50 to total spend
Art Books	+$14.46	<0.001	***	Each art book adds ~$14.50 to total spend
Geography/Travel	+$15.18	<0.001	***	Each geography book adds ~$15 to total spend

R² = 0.2656 — the model explains about 27% of variance in offline spend, which is reasonable given we're predicting spending purely from transaction counts without pricing or channel data.

Logistic Regression — Predicting Subscription

The logistic regression models the log-odds of subscribing to the online book club as a function of recency (months since last offline purchase), monetary value (total offline spend), gender, and book genre purchase counts. Unlike linear regression, logistic regression outputs a probability between 0 and 1, making it ideal for binary classification like subscription prediction.

subscribe ~ last + total_ + IsFemale + child + youth + cook + do_it + refernce + art + geog

Logistic Regression — Odds Ratios

Odds ratios above 1.0 increase subscription probability; below 1.0 decrease it. Art books (OR = 3.18) and geography/travel books (OR = 1.78) are the strongest positive predictors. DIY books (OR = 0.58) and being female (OR = 0.47) are the strongest negative predictors. All variables are highly significant (p < 0.001).

Variable	Coefficient	Odds Ratio	p-value	Interpretation
last (recency)	−0.095	0.910	<0.001	Each additional month since last purchase reduces subscription odds by ~9% — recency matters
total_ (spend)	+0.001	1.001	<0.001	Higher offline spenders are marginally more likely to subscribe
IsFemale	−0.761	0.467	<0.001	Female customers are 53% less likely to subscribe than M/NB, controlling for other factors
Children's Books	−0.186	0.830	<0.001	Each children's book reduces subscription odds by 17%
Young Adult	−0.113	0.893	<0.001	Each YA book reduces odds by 11%
Cookbooks	−0.270	0.763	<0.001	Each cookbook reduces odds by 24%
DIY Books	−0.539	0.583	<0.001	Each DIY book reduces odds by 42% — strongest negative genre signal
Reference	+0.235	1.265	<0.001	Each reference book increases odds by 26%
Art Books	+1.156	3.176	<0.001	Each art book more than triples subscription odds — strongest positive signal
Geography/Travel	+0.574	1.776	<0.001	Each geography book increases odds by 78%

Key Logistic Regression Insights

The model tells a clear story: the ideal e-book club subscriber is a recent M/NB customer who buys art and geography books offline. Art books have a remarkably strong signal (OR = 3.18), possibly because visual/artistic content translates especially well to the visual e-book format. DIY and cookbook buyers, by contrast, likely prefer physical books for practical reference use — a digital subscription has less utility for them. Recency is a critical lever: every month of inactivity meaningfully reduces conversion probability.

Decile Profiling — Top vs. Bottom Predicted Subscribers

Using the logistic model's predicted probabilities, I ranked all 50,000 customers into 10 equal deciles (5,000 each). I then profiled the top decile (highest predicted subscription probability) and bottom decile (lowest) to understand who to target and who to deprioritize in a promotional campaign.

4.28× Top Decile Lift

38.7% Top Decile Actual Sub Rate

9.0% Overall Subscription Rate

Top Decile — Target Segment

38.7%

Actual subscription rate (n=5,000)

Avg predicted prob	38.6%
Avg total offline spend	$257.35
Avg recency (months)	7.2 months
Avg books purchased	6.5 books
% Female	41.9%
Top genre	Art books

Bottom Decile — Do Not Target

0.8%

Actual subscription rate (n=5,000)

Avg predicted prob	0.65%
Avg total offline spend	$204.34
Avg recency (months)	25.9 months
Avg books purchased	4.2 books
% Female	78.2%
Top genre	DIY / Cookbooks

Actual Subscription Rate by Predicted Decile

The model shows strong monotonic lift — predicted deciles reliably rank customers by actual subscription propensity. The top decile (decile 10) achieves 38.7% actual subscription vs. 0.8% in the bottom decile, a 48× spread. This steep gradient validates that the logistic model is meaningfully discriminating between high and low probability subscribers.

Top decile profile summary: Recent purchasers (7.2 months avg) who buy art books, are disproportionately M/NB (58.1%), and have above-average offline spend ($257). They are active, engaged book buyers who likely gravitate toward visual and experiential content — exactly what a visual e-book subscription offers. Bottom decile profile: Lapsed customers (25.9 months since last purchase) who skew female (78.2%) and favor DIY and cookbooks — practical physical-format readers who find less value in a visual digital subscription.

Email Campaign Design

Based on the model outputs, I designed a targeted email campaign to convert high-probability customers. The campaign has two tiers: a primary outreach to the top two deciles (~10,000 customers) and a reactivation campaign for middle deciles who haven't purchased recently.

Tier 1 — Top Two Deciles (Highest-Probability Subscribers)

Top 2 deciles (deciles 9–10) — ~10,000 customers. Filter: predicted probability > 25%, last purchase within 18 months, has purchased art or geography books.

Subject Line

"The books you love — now in a format made for them."
A/B test alternative: "2 visual e-books, every month. On us — first month free."

Content

Lead with a curated art book or illustrated travel title relevant to their purchase history (personalized by genre). Show a preview of the visual reading experience — the format is the product differentiator. Include a single clear CTA: "Start my free month." Social proof: "Join X readers already exploring our visual collection." Keep it under 200 words — this audience already knows Florence Books; they need a reason to try digital, not a brand introduction.

Timing

Send Tuesday or Wednesday evening (highest open rates for retail email). Since calendar purchase timing is unavailable, use relationship timing: send 3–4 weeks after a customer's most recent offline purchase — the engagement window when brand recall is highest. For lapsed customers (6–18 months), send on the anniversary of their last purchase date to trigger recall.

Offer

First month free, then $X/month. The barrier is trial, not price — art and geography buyers have above-average spend profiles. A free trial is more effective than a discount for this segment.

Tier 2 — Middle Deciles (Recency-Recoverable Customers)

Deciles 5–8 — customers with moderate predicted probability (8–25%) who have purchased within 24 months. Focus on those with reference or art book purchases.