MKT 326 · Assignment 2

Buying Decisions — R Code

Rsqldfglm lmggplot2dplyr
Analysis R Code

Setup & Data Import

Rsetup.R
knitr::opts_chunk$set(warning = FALSE, message = FALSE)

library(sqldf)
library(ggplot2)
library(dplyr)

setwd("~/Desktop/MKTAnalytics/Data")
df <- read.csv("OnlineBookClub.csv")

str(df)
head(df)

SQL Descriptive Statistics

Rsql_descriptive.R
# Means via SQL — sqldf runs SQLite which lacks STDEV(),
# so we use R's sd() for standard deviations
sqldf("SELECT AVG(total_) AS AvgTotalSpend FROM df")
cat("SD total spend: ", sd(df$total_), "\n")

sqldf("SELECT AVG(purch) AS AvgBooksPurchased FROM df")
cat("SD books purchased: ", sd(df$purch), "\n")

sqldf("SELECT AVG(last) AS AvgRecency FROM df")
cat("SD recency: ", sd(df$last), "\n")

# 4) Subscriptions by gender
sqldf("
  SELECT gender,
         COUNT(*)                                               AS TotalCustomers,
         SUM(CASE WHEN subscribe = 'yes' THEN 1 ELSE 0 END)   AS Subscribers,
         CAST(SUM(CASE WHEN subscribe = 'yes' THEN 1 ELSE 0 END) AS FLOAT)
           / COUNT(*)                                          AS SubRate
  FROM df
  GROUP BY gender
")
Output
AvgTotalSpend 1 208.32 SD total spend: 101.36 AvgBooksPurchased 1 3.89 SD books purchased: 3.48 AvgRecency 1 12.36 SD recency: 8.15 gender TotalCustomers Subscribers SubRate 1 F 33302 2389 0.072 2 M/NB 16698 2133 0.128

Linear Regression — Predicting Offline Spend

Dependent variable: total_ (total offline dollars spent). Independent variables: IsFemale, months since first purchase, and all book category counts.

Rlinear_regression.R
# Create binary gender indicator
df$IsFemale <- ifelse(df$gender == "F", 1, 0)

# Fit linear model
lin_model <- lm(total_ ~ IsFemale + first +
                  child + youth + cook + do_it +
                  refernce + art + geog,
                data = df)

summary(lin_model)
Output — summary(lin_model)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 149.6275 0.9513 157.36 <2e-16 *** IsFemale 0.6786 0.8650 0.78 0.433 first -0.0335 0.0384 -0.87 0.382 child 15.2642 0.4216 36.21 <2e-16 *** youth 15.4177 0.6267 24.60 <2e-16 *** cook 15.6558 0.4059 38.59 <2e-16 *** do_it 15.0080 0.5878 25.53 <2e-16 *** refernce 14.5116 0.6985 20.78 <2e-16 *** art 14.4600 0.6303 22.94 <2e-16 *** geog 15.1818 0.5284 28.73 <2e-16 *** --- Signif. codes: 0 '***' Residual standard error: 87.6 Multiple R-squared: 0.2656 F-statistic: 2012 on 9 and 49990 DF, p-value: < 2.2e-16

Logistic Regression — Predicting Subscription

Dependent variable: subscribe (yes/no → binary 0/1). Independent variables: recency, monetary, IsFemale, and all book category counts.

Rlogistic_regression.R
# Create binary outcome
df$subscribe_bin <- ifelse(df$subscribe == "yes", 1, 0)

# Fit logistic model (family = binomial → logistic regression)
log_model <- glm(subscribe_bin ~ last + total_ + IsFemale +
                   child + youth + cook + do_it +
                   refernce + art + geog,
                 data = df,
                 family = binomial())

summary(log_model)

# Exponentiate coefficients to get odds ratios
exp(coef(log_model))
Output — Coefficients & Odds Ratios
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.6001 0.0521 -30.71 <2e-16 *** last -0.0947 0.0028 -33.92 <2e-16 *** total_ 0.0011 0.0002 5.63 <2e-16 *** IsFemale -0.7607 0.0358 -21.27 <2e-16 *** child -0.1862 0.0173 -10.78 <2e-16 *** youth -0.1130 0.0261 -4.33 <2e-16 *** cook -0.2703 0.0171 -15.78 <2e-16 *** do_it -0.5392 0.0270 -19.99 <2e-16 *** refernce 0.2347 0.0266 8.84 <2e-16 *** art 1.1556 0.0221 52.19 <2e-16 *** geog 0.5743 0.0186 30.82 <2e-16 *** Odds Ratios: (Intercept) last total_ IsFemale child youth cook do_it refernce art geog 0.2019 0.9096 1.0011 0.4673 0.8301 0.8932 0.7631 0.5832 1.2645 3.1759 1.7758

Decile Profiling

Rdecile_analysis.R
# Predict probability of subscribing for each customer
df$pred_prob <- predict(log_model, type = "response")

# Assign each customer to a decile (1 = lowest prob, 10 = highest prob)
df$decile <- ntile(df$pred_prob, 10)   # dplyr::ntile

# Profile each decile
decile_profile <- sqldf("
  SELECT decile,
         COUNT(*)                                                AS Customers,
         AVG(pred_prob)                                         AS AvgPredProb,
         SUM(subscribe_bin)                                     AS ActualSubscribers,
         CAST(SUM(subscribe_bin) AS FLOAT) / COUNT(*)           AS ActualSubRate,
         AVG(total_)                                            AS AvgSpend,
         AVG(last)                                              AS AvgRecency,
         AVG(purch)                                             AS AvgBooks,
         AVG(IsFemale)                                          AS PctFemale
  FROM df
  GROUP BY decile
  ORDER BY decile
")

print(decile_profile)

# Highlight top and bottom decile
top_decile <- df[df$decile == 10, ]
bot_decile <- df[df$decile == 1,  ]

cat("Top decile actual sub rate:", mean(top_decile$subscribe_bin), "\n")
cat("Bot decile actual sub rate:", mean(bot_decile$subscribe_bin), "\n")
cat("Lift (top vs overall):", mean(top_decile$subscribe_bin) / mean(df$subscribe_bin), "\n")

# Lift chart
ggplot(decile_profile, aes(x = decile, y = ActualSubRate * 100)) +
  geom_col(fill = "#1a1a2e") +
  geom_hline(aes(yintercept = mean(df$subscribe_bin) * 100),
             color = "#e94560", linetype = "dashed") +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Actual Subscription Rate by Predicted Decile",
       x = "Decile (1 = lowest, 10 = highest predicted)",
       y = "Actual Sub Rate (%)")
Output — Key Deciles
decile Customers AvgPredProb ActualSubRate AvgSpend AvgRecency AvgBooks PctFemale 1 1 5000 0.0065 0.008 204.34 25.9 4.2 0.782 ... 10 10 5000 0.3856 0.387 257.35 7.2 6.5 0.419 Top decile actual sub rate: 0.387 Bot decile actual sub rate: 0.008 Lift (top vs overall): 4.28
← back to analysis