Prediction of Buying Behavior

Analyzing data for potential insight to inform a “free-to-fee” strategy. Looking into factors that affect users’ decisions to pay for a premium subscription. Findings quantified the effect of social engagement on revenue, as well as how valuable a premium subscriber can be in a freemium social community.

Data file

INDEX:

1: Descriptive statistics
2: Visualizations
3: Propensity Score Matching (PSM)
4: Regression analysis
5: Conclusion

Importing libraries

In [47]:
library("dplyr")
library("psych")
library("purrr")
library("ggplot2")
library("gapminder")
library("gganimate")
library("MatchIt")
library("gridExtra")

Overview of dataset

In [3]:
data <- read.csv("HighNote Data.csv")
head(data)
IDagemalefriend_cntavg_friend_ageavg_friend_malefriend_country_cntsubscriber_friend_cntsongsListenedlovedTrackspostsplaylistsshoutsadoptertenuregood_country
1 22 0 8 22.57143 0.42857141 0 9687 194 0 1 8 0 59 1
2 35 0 2 28.00000 1.00000002 0 0 0 0 0 0 0 35 0
3 27 1 2 23.00000 1.00000001 0 508 0 0 1 2 0 42 0
4 21 0 28 22.94737 0.50000007 1 1357 32 0 0 1 0 25 0
5 24 0 65 22.28302 0.91379319 0 89984 20 2 0 81 0 67 0
6 21 1 12 25.00000 0.77777781 0 124547 10 0 1 2 0 53 1
In [4]:
# No categorical variables in the dataset
glimpse(data)
Observations: 43,827
Variables: 16
$ ID                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
$ age                   <int> 22, 35, 27, 21, 24, 21, 20, 23, 24, 34, 20, 2...
$ male                  <int> 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, ...
$ friend_cnt            <int> 8, 2, 2, 28, 65, 12, 15, 57, 4, 13, 18, 2, 3,...
$ avg_friend_age        <dbl> 22.57143, 28.00000, 23.00000, 22.94737, 22.28...
$ avg_friend_male       <dbl> 0.4285714, 1.0000000, 1.0000000, 0.5000000, 0...
$ friend_country_cnt    <int> 1, 2, 1, 7, 9, 1, 1, 14, 1, 3, 11, 1, 2, 2, 8...
$ subscriber_friend_cnt <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
$ songsListened         <int> 9687, 0, 508, 1357, 89984, 124547, 24852, 998...
$ lovedTracks           <int> 194, 0, 0, 32, 20, 10, 391, 125, 42, 82, 15, ...
$ posts                 <int> 0, 0, 0, 0, 2, 0, 6, 89, 0, 0, 4, 0, 0, 0, 0,...
$ playlists             <int> 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
$ shouts                <int> 8, 0, 2, 1, 81, 2, 67, 44, 5, 3, 18, 3, 3, 10...
$ adopter               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ tenure                <int> 59, 35, 42, 25, 67, 53, 56, 71, 34, 49, 30, 4...
$ good_country          <int> 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, ...

Descriptive statistics

In [5]:
#Descriptive statistics for key variables split by adopter
# Descriptive statistics for adopter and nonadpoter have different outputs. The mean values of the key variables for 
# adopter sample  are higher in comparison to the ones of nonadopter sample. The variance of values is generally higher in
# adopter sample, therefore the kurtosis is lower than in nonadopter sample, which means the distribution is less 
# sharpenned. Adopter mean is greater than adopter median which indicates that the disctribution is skewed to the right. 
# The same refers to nonadapter sample. The samples  have the same min values. Maximum values are higher in nonadopter 
# sample which laso explains their higher range. 
# With all said, we can conclude some facts about the samples: 
# Premium suscribers (adopters) are slighly older than free users (nonadopters) in its majority, with more male 
# population than nonadopters, they have significantly more friends both free and premium users. They are much more 
# engaged with the Highnote content. Premium subscribers also have longer experience with Highnote and its population is
# less US, UK and Germany users as it is observed among free users.
keydata<-subset(data, select=-c(ID, avg_friend_age,avg_friend_male,friend_country_cnt))
keydata %>% split(.$adopter) %>% map(describe)
$`0`
varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
age 1 40300 2.394844e+016.371831e+00 23 2.309355e+01 4.4478 8 79 71 1.96742309 6.79540293.174035e-02
male 2 40300 6.218610e-014.849286e-01 1 6.523263e-01 0.0000 0 1 1 -0.50258131 -1.74745542.415601e-03
friend_cnt 3 40300 1.849166e+015.748117e+01 7 1.028074e+01 7.4130 1 4957 4956 32.67366911 2087.42440812.863341e-01
subscriber_friend_cnt 4 40300 4.174690e-012.418151e+00 0 1.255893e-01 0.0000 0 309 309 72.19357997 8024.61706601.204567e-02
songsListened 5 40300 1.758944e+042.841602e+047440 1.181764e+0410576.8684 0 1000000 1000000 6.04954176 105.84596371.415503e+02
lovedTracks 6 40300 8.682263e+012.635804e+02 14 3.635102e+01 20.7564 0 12522 12522 13.11545425 335.93178891.312988e+00
posts 7 40300 5.293002e+001.043094e+02 0 2.254342e-01 0.0000 0 12309 12309 73.91520131 7005.33500365.196023e-01
playlists 8 40300 5.492804e-011.071956e+00 0 4.463400e-01 0.0000 0 98 98 28.21141801 1945.27874235.339791e-03
shouts 9 40300 2.997266e+011.506898e+02 4 8.841563e+00 4.4478 0 7736 7736 22.53491092 779.12269787.506393e-01
adopter10 40300 0.000000e+000.000000e+00 0 0.000000e+00 0.0000 0 0 0 NaN NaN0.000000e+00
tenure11 40300 4.380993e+011.978887e+01 44 4.372413e+01 22.2390 1 111 110 0.04578935 -0.69919439.857536e-02
good_country12 40300 3.577916e-014.793563e-01 0 3.222395e-01 0.0000 0 1 1 0.59331611 -1.64801692.387844e-03
$`1`
varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
age 1 3527 2.597987e+016.843597e+00 24 2.505243e+01 4.4478 8 73 65 1.68330111 4.38827321.152343e-01
male 2 3527 7.292316e-014.444197e-01 1 7.863974e-01 0.0000 0 1 1 -1.03130738 -0.93667047.483255e-03
friend_cnt 3 3527 3.973377e+011.172749e+02 16 2.369182e+01 17.7912 1 5089 5088 26.04316491 1013.78544991.974705e+00
subscriber_friend_cnt 4 3527 1.636802e+005.849981e+00 0 8.402409e-01 0.0000 0 287 287 34.04550993 1609.51623509.850351e-02
songsListened 5 3527 3.375804e+044.359273e+0420908 2.581169e+0423276.8200 0 817290 817290 4.71192298 46.63596097.340258e+02
lovedTracks 6 3527 2.643408e+024.914268e+02 108 1.616844e+02 140.8470 0 10220 10220 6.51951052 80.96130468.274773e+00
posts 7 3527 2.120045e+012.219934e+02 0 1.436415e+00 0.0000 0 8506 8506 26.52258090 852.38134353.737984e+00
playlists 8 3527 9.007655e-012.563392e+00 1 5.876727e-01 1.4826 0 118 118 28.84310504 1244.30554784.316306e-02
shouts 9 3527 9.943975e+011.156073e+03 9 2.388913e+01 11.8608 0 65872 65872 52.51753042 2969.08645001.946626e+01
adopter10 3527 1.000000e+000.000000e+00 1 1.000000e+00 0.0000 1 1 0 NaN NaN0.000000e+00
tenure11 3527 4.558322e+012.004376e+01 46 4.559830e+01 20.7564 0 111 111 0.01966554 -0.62495173.375022e-01
good_country12 3527 2.874965e-014.526592e-01 0 2.345023e-01 0.0000 0 1 1 0.93864568 -1.11926147.621994e-03
In [6]:
#Let's look at the differences in the mean values of the variables in the adopter vs non-adapter subsamples.
lapply(data[,c('age','male' , 'friend_cnt' , 'avg_friend_male' ,'avg_friend_age', 
               'friend_country_cnt'  , 'songsListened' , 'lovedTracks' , 
               'posts' , 'playlists' ,'shouts' , 'tenure' ,'good_country', 'subscriber_friend_cnt')], function(i) t.test(i ~ data$adopter))
$age

	Welch Two Sample t-test

data:  i by data$adopter
t = -16.996, df = 4079.3, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.265768 -1.797097
sample estimates:
mean in group 0 mean in group 1 
       23.94844        25.97987 


$male

	Welch Two Sample t-test

data:  i by data$adopter
t = -13.654, df = 4295, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.12278707 -0.09195413
sample estimates:
mean in group 0 mean in group 1 
      0.6218610       0.7292316 


$friend_cnt

	Welch Two Sample t-test

data:  i by data$adopter
t = -10.646, df = 3675.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -25.15422 -17.32999
sample estimates:
mean in group 0 mean in group 1 
       18.49166        39.73377 


$avg_friend_male

	Welch Two Sample t-test

data:  i by data$adopter
t = -4.4426, df = 4591.6, p-value = 9.097e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.02883955 -0.01117951
sample estimates:
mean in group 0 mean in group 1 
      0.6165888       0.6365983 


$avg_friend_age

	Welch Two Sample t-test

data:  i by data$adopter
t = -15.658, df = 4140.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.608931 -1.250852
sample estimates:
mean in group 0 mean in group 1 
       24.01142        25.44131 


$friend_country_cnt

	Welch Two Sample t-test

data:  i by data$adopter
t = -21.267, df = 3791.6, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.528795 -2.933081
sample estimates:
mean in group 0 mean in group 1 
       3.957891        7.188829 


$songsListened

	Welch Two Sample t-test

data:  i by data$adopter
t = -21.629, df = 3792.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -17634.24 -14702.96
sample estimates:
mean in group 0 mean in group 1 
       17589.44        33758.04 


$lovedTracks

	Welch Two Sample t-test

data:  i by data$adopter
t = -21.188, df = 3705.6, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -193.9447 -161.0917
sample estimates:
mean in group 0 mean in group 1 
       86.82263       264.34080 


$posts

	Welch Two Sample t-test

data:  i by data$adopter
t = -4.2151, df = 3663.5, p-value = 2.557e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -23.30665  -8.50825
sample estimates:
mean in group 0 mean in group 1 
       5.293002       21.200454 


$playlists

	Welch Two Sample t-test

data:  i by data$adopter
t = -8.0816, df = 3634.7, p-value = 8.619e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.4367565 -0.2662138
sample estimates:
mean in group 0 mean in group 1 
      0.5492804       0.9007655 


$shouts

	Welch Two Sample t-test

data:  i by data$adopter
t = -3.5659, df = 3536.5, p-value = 0.0003674
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -107.66170  -31.27249
sample estimates:
mean in group 0 mean in group 1 
       29.97266        99.43975 


$tenure

	Welch Two Sample t-test

data:  i by data$adopter
t = -5.0434, df = 4150.6, p-value = 4.768e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.462620 -1.083959
sample estimates:
mean in group 0 mean in group 1 
       43.80993        45.58322 


$good_country

	Welch Two Sample t-test

data:  i by data$adopter
t = 8.8009, df = 4248.5, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.05463587 0.08595434
sample estimates:
mean in group 0 mean in group 1 
      0.3577916       0.2874965 


$subscriber_friend_cnt

	Welch Two Sample t-test

data:  i by data$adopter
t = -12.287, df = 3632.2, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.413899 -1.024766
sample estimates:
mean in group 0 mean in group 1 
       0.417469        1.636802 

Visualizations

Demographics

In [62]:
# majority of non-adopters population are users between 18-30 years, and by count this population is much larger 
# than population of adopters.
# Adopters are generally a little older than nonadopters by 2-3 years.
options(repr.plot.width=6, repr.plot.height=4)
ggplot(data,aes(x=age,group=adopter,fill=adopter))+
  geom_histogram(position="identity",binwidth=0.5)+theme_minimal()
In [64]:
# Male population substencially prevails in both samples, by count we notice that adopter is a much smaller sample 
# than non-adopter. 
options(repr.plot.width=4, repr.plot.height=3)
ggplot(data,aes(x=male,group=adopter,fill=adopter))+
  geom_histogram(position="identity",binwidth=0.5)+theme_minimal()
In [65]:
# Both samples have more users from the rest of the world and much fewer users are from US, UK, Germany. 
ggplot(data,aes(x=good_country,group=adopter,fill=adopter))+
  geom_bar(position="dodge")+theme_minimal()

Peer influence

In [48]:
# In general, average amount of friends for adopters is around 40, whereas for non-adopters it is twice as less around 
# 18 friends per user.
options(repr.plot.width=3, repr.plot.height=2)
friend_cnt<-data %>%
  group_by(adopter)%>%
  summarise(friend_cnt=mean(friend_cnt))
ggplot(friend_cnt,aes(x = adopter,y=friend_cnt)) +
  geom_bar(stat="identity",position=position_identity(), fill="orange")+theme_minimal()
In [66]:
# Both samples demonstrate similar social characteristics in dependance with age: the younger the user, 
# the more friends he has and vs the older the user, the less friends he has. Users with highest amount 
# of friends are aged between 15 and 35 and users aged 55-65 have smallest friends list.
options(repr.plot.width=6, repr.plot.height=4)
m<-ggplot(data, aes(x = age, y = friend_cnt)) + 
  geom_point() +
  facet_wrap(~ adopter)+
  ylim(c(0, 2200))+
  geom_smooth(method = 'lm', color='red')
suppressWarnings(print(m))
In [12]:
# Average friends' age for majority adopters and non-adopters is around the same 15-45 years old.
ggplot(data,aes(x=avg_friend_age,group=adopter,fill=adopter))+
  geom_histogram(position="identity",binwidth=0.5)+theme_minimal()
In [49]:
# Adopters have around three times more friends who are premium subscribers than non-adopters.
subscriber_friend_cnt<-data %>%
  group_by(adopter)%>%
  summarise(subscriber_friend_cnt=mean(subscriber_friend_cnt))
ggplot(subscriber_friend_cnt,aes(x = adopter,y=subscriber_friend_cnt)) +
  geom_bar(stat="identity",position=position_identity(), fill="green")+theme_minimal()
In [50]:
# Male-female distribution among 2 samples is very similar with slight difference in favor of males in adopters.
avg_friend_male<- data %>%
  group_by(adopter)%>%
  summarise(avg_friend_male=mean(avg_friend_male))
ggplot(avg_friend_male,aes(x = adopter,y=avg_friend_male)) +
  geom_bar(stat="identity",position=position_identity(), fill="green")+theme_minimal()

User engagement

In [51]:
# Adopters demonstrate a significantly higher engagement than non-adopters. On average, they listen to around 34K songs, 
# whereas non-adopters listen to twice as less around 18K songs.
songsListened<- data %>%
  group_by(adopter)%>%
  summarise(songsListened=mean(songsListened))
ggplot(songsListened,aes(x = adopter,y=songsListened)) +
  geom_bar(stat="identity",position=position_identity(), fill="green")+theme_minimal()
In [52]:
# LovedTracks follows the same tendency. Though the gap between the two samples is even higher: appx. 260 vs 86.
lovedTracks<- data %>%
  group_by(adopter)%>%
  summarise(lovedTracks=mean(lovedTracks))
ggplot(lovedTracks,aes(x = adopter,y=lovedTracks)) +
  geom_bar(stat="identity",position=position_identity(), fill="green")+theme_minimal()
In [53]:
# Adopters post much more than non-adopters. On average, premium users have 21 posts, whilst free users around 5.
posts<-data %>%
  group_by(adopter)%>%
  summarise(posts=mean(posts))
ggplot(posts,aes(x = adopter,y=posts)) +
  geom_bar(stat="identity",position=position_identity(), fill="green")+theme_minimal()
In [54]:
# Adopters on average have around 0.9 playlists, while non-adopters only around 0.55.
playlists<-data %>%
  group_by(adopter)%>%
  summarise(playlists=mean(playlists))
ggplot(playlists,aes(x = adopter,y=playlists)) +
  geom_bar(stat="identity",position=position_identity(), fill="green")+theme_minimal()
In [55]:
# Adopters received on average around 100 shouts from other users, while non-adopters received around 30 shouts.
shouts<-data %>%
  group_by(adopter)%>%
  summarise(shouts=mean(shouts))
ggplot(shouts,aes(x = adopter,y=shouts)) +
  geom_bar(stat="identity",position=position_identity(), fill="green")+theme_minimal()
In [56]:
# On average, adopters have been on the site for around 46 months, while non-adopters around 43.
tenure<- data %>%
  group_by(adopter)%>%
  summarise(tenure=mean(tenure))
ggplot(tenure,aes(x = adopter,y=tenure)) +
  geom_bar(stat="identity",position=position_identity(), fill="green")+theme_minimal()
In [68]:
# Boxplot proves the conclusion about mean values of tenure variable.
options(repr.plot.width=4, repr.plot.height=3)
boxplot(tenure~adopter,data=data, main="Adopter&Tenure Data", 
        xlab="Adopter", ylab="Tenure")
In [22]:
# Binning age into groups
data$age_group<-as.list(data$age)
newdf <- data %>%
  mutate(age_group = case_when(
    age_group > 0 & age_group<= 17 ~ "1",
    age_group > 17 & age_group <=34 ~ "2",
    age_group > 34 & age_group <=49   ~ "3",
    age_group > 49 & age_group <= 64  ~ "4",
    age_group > 64  ~ "5",
    TRUE                                 ~ "NA"
  ))
In [69]:
# The following dynamic graph demonstrates user activity (songsListened) by age group throughout their time on site.
# For adopter sample, the groups 2 and 3 listened more songs, while the groups 1,2,3 are more active in non-adopter sample. Group 2 shows the highest level of engagement for both samples.
gif<-ggplot(newdf, aes(age_group, songsListened, colour = good_country)) +
  geom_point(alpha = 1, show.legend = FALSE) +
  scale_size(range = c(12)) +
  facet_wrap(~adopter) +
  labs(title = 'Tenure: {frame_time}', x = 'age_group', y = 'songsListened') +
  transition_time(tenure) +
  ease_aes('linear')+
  ylim(c(0,200000))
suppressWarnings(print(gif))
                                                                              
In [24]:
# This scatterplot matrix demonstrates relationships between different variables for free users (nonadopters).
# We can see linear relationships between friends and premium user friends for both samples which indicates the more 
# friends a user has the more likely he will have premium users as his friends.
# The more time a user spent on site, the more songs a user listened.
adopter<-filter(data, adopter==1)
nonadopter<-filter(data, adopter==0)
pairs(~friend_cnt+tenure+subscriber_friend_cnt+songsListened, data=nonadopter,lower.panel = panel.smooth, upper.panel = panel.smooth,
      main="Scatterplot Matrix for NonAdopter")
In [25]:
# This matrix demonstrates relationships between different variables for premium users (adopters)
pairs(~friend_cnt+tenure+subscriber_friend_cnt+songsListened, data=adopter,lower.panel = panel.smooth, upper.panel = panel.smooth,
      main="Scatterplot Matrix for Adopter")

Propensity Score Matching (PSM)

In [26]:
# Grouping subscriber_friend_cnt into "treatment" group (1) and "control" group (0)
data$subscriber_friend_cnt <- ifelse(data$subscriber_friend_cnt >0,1,0)
In [27]:
# T-test for subscriber_friend_cnt by adopter
# Means of treatment and control groups are significantly different.
with(data, t.test(subscriber_friend_cnt ~ adopter))
	Welch Two Sample t-test

data:  subscriber_friend_cnt by adopter
t = -33.978, df = 3931.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3109641 -0.2770354
sample estimates:
mean in group 0 mean in group 1 
      0.2004715       0.4944712 
In [57]:
# Estimating means of covariates
data_cov <- c('age', 'male', 'good_country', 'friend_cnt', 'avg_friend_age', 'avg_friend_male', 'friend_country_cnt', 'songsListened', 'lovedTracks', 'posts', 'playlists', 'shouts', 'tenure' )
data %>%
  group_by(adopter) %>%
  select(one_of(data_cov)) %>%
  summarise_all(funs(mean(., na.rm = T)))
Adding missing grouping variables: `adopter`
adopteragemalegood_countryfriend_cntavg_friend_ageavg_friend_malefriend_country_cntsongsListenedlovedTrackspostsplaylistsshoutstenure
0 23.94844 0.62186100.357791618.49166 24.01142 0.61658883.957891 17589.44 86.82263 5.2930020.549280429.97266 43.80993
1 25.97987 0.72923160.287496539.73377 25.44131 0.63659837.188829 33758.04 264.3408021.2004540.900765599.43975 45.58322
In [29]:
# Propensity score estimation
m_pscore <- suppressWarnings(glm(subscriber_friend_cnt ~ age + male + good_country + 
                  friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt + 
                  songsListened + lovedTracks + posts + playlists + shouts + tenure,
                family = binomial(), data = data))
In [30]:
summary(m_pscore)
Call:
glm(formula = subscriber_friend_cnt ~ age + male + good_country + 
    friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt + 
    songsListened + lovedTracks + posts + playlists + shouts + 
    tenure, family = binomial(), data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.4206  -0.5671  -0.4220  -0.3001   2.5619  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -5.144e+00  7.703e-02 -66.782  < 2e-16 ***
age                 1.970e-02  2.808e-03   7.015 2.30e-12 ***
male                4.311e-02  2.998e-02   1.438 0.150419    
good_country        3.201e-02  2.922e-02   1.096 0.273235    
friend_cnt          3.132e-02  1.034e-03  30.301  < 2e-16 ***
avg_friend_age      7.955e-02  3.481e-03  22.850  < 2e-16 ***
avg_friend_male     2.514e-01  5.029e-02   4.999 5.75e-07 ***
friend_country_cnt  1.110e-01  4.765e-03  23.302  < 2e-16 ***
songsListened       6.906e-06  5.156e-07  13.396  < 2e-16 ***
lovedTracks         6.671e-04  5.645e-05  11.817  < 2e-16 ***
posts               5.699e-04  2.682e-04   2.125 0.033613 *  
playlists           5.639e-03  1.190e-02   0.474 0.635530    
shouts             -4.909e-05  3.707e-05  -1.324 0.185434    
tenure             -2.571e-03  7.769e-04  -3.309 0.000935 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 46640  on 43826  degrees of freedom
Residual deviance: 34170  on 43813  degrees of freedom
AIC: 34198

Number of Fisher Scoring iterations: 7
In [31]:
prscore_df <- data.frame(pr_score = predict(m_pscore, type = "response"),
                         subscriber_friend_cnt = m_pscore$model$subscriber_friend_cnt)
head(prscore_df)
pr_scoresubscriber_friend_cnt
0.085973340
0.144177670
0.082170100
0.238940671
0.695522080
0.223066330
In [32]:
head(m_pscore$model)
subscriber_friend_cntagemalegood_countryfriend_cntavg_friend_ageavg_friend_malefriend_country_cntsongsListenedlovedTrackspostsplaylistsshoutstenure
0 22 0 1 8 22.57143 0.42857141 9687 194 0 1 8 59
0 35 0 0 2 28.00000 1.00000002 0 0 0 0 0 35
0 27 1 0 2 23.00000 1.00000001 508 0 0 1 2 42
1 21 0 0 28 22.94737 0.50000007 1357 32 0 0 1 25
0 24 0 0 65 22.28302 0.91379319 89984 20 2 0 81 67
0 21 1 1 12 25.00000 0.77777781 124547 10 0 1 2 53
In [33]:
labs <- paste("Type of User:", c("Premium", "Non-Premium"))
pscore_df<-prscore_df %>%
  mutate(adopter = ifelse(subscriber_friend_cnt == 1, labs[1], labs[2]))
ggplot(prscore_df,aes(x = pr_score)) +
  geom_histogram(color="black", fill="green") +
  facet_wrap(~subscriber_friend_cnt) +
  xlab("Probability of Treatment (Having >1 Subscriber)") +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In [34]:
#Looking for pairs of observations with similar propensity scores
data_nomiss <- data %>%  
  select(subscriber_friend_cnt, adopter, one_of(data_cov)) %>%
  na.omit()

match <- suppressWarnings(matchit(subscriber_friend_cnt ~ age + male + good_country + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt + songsListened + lovedTracks + posts + playlists + shouts + tenure,
                 method = "nearest", data = data_nomiss))
In [35]:
# Information if matching was successful
summary(match)
Call:
matchit(formula = subscriber_friend_cnt ~ age + male + good_country + 
    friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt + 
    songsListened + lovedTracks + posts + playlists + shouts + 
    tenure, data = data_nomiss, method = "nearest")

Summary of balance for all data:
                   Means Treated Means Control SD Control  Mean Diff    eQQ Med
distance                  0.4635        0.1550     0.1436     0.3086     0.2506
age                      25.3732       23.7476     6.2245     1.6256     1.0000
male                      0.6363        0.6288     0.4831     0.0074     0.0000
good_country              0.3433        0.3547     0.4784    -0.0114     0.0000
friend_cnt               54.0210       10.4313    15.2769    43.5896    22.0000
avg_friend_age           25.3904       23.7614     5.0577     1.6291     1.5909
avg_friend_male           0.6358        0.6131     0.3343     0.0227     0.0738
friend_country_cnt        9.3856        2.7251     3.1024     6.6606     5.0000
songsListened         33735.6404    14602.2205 23214.2898 19133.4199 15471.0000
lovedTracks             225.3647       65.2137   181.4812   160.1510    65.0000
posts                    20.5230        2.5434    33.7947    17.9796     0.0000
playlists                 0.7441        0.5295     0.9673     0.2146     0.0000
shouts                  101.8195       16.4230    79.7381    85.3965    15.0000
tenure                   46.5487       43.2027    19.7212     3.3460     3.0000
                     eQQ Mean     eQQ Max
distance               0.3086      0.6840
age                    1.6296      5.0000
male                   0.0074      1.0000
good_country           0.0114      1.0000
friend_cnt            43.5838   4794.0000
avg_friend_age         1.6369     11.5000
avg_friend_male        0.0958      0.3636
friend_country_cnt     6.6598     95.0000
songsListened      19126.1623 653702.0000
lovedTracks          159.9562   6343.0000
posts                 17.8829   9535.0000
playlists              0.2092     26.0000
shouts                85.1764  59168.0000
tenure                 3.3473     10.0000


Summary of balance for matched data:
                   Means Treated Means Control SD Control Mean Diff   eQQ Med
distance                  0.4635        0.3040     0.1913    0.1596    0.1077
age                      25.3732       26.3324     7.9056   -0.9592    1.0000
male                      0.6363        0.6576     0.4745   -0.0214    0.0000
good_country              0.3433        0.3581     0.4795   -0.0149    0.0000
friend_cnt               54.0210       21.4666    23.5251   32.5544   12.0000
avg_friend_age           25.3904       26.5572     6.7320   -1.1668    0.4376
avg_friend_male           0.6358        0.6551     0.2643   -0.0193    0.0158
friend_country_cnt        9.3856        5.0914     4.6473    4.2942    2.0000
songsListened         33735.6404    27360.8630 33892.7804 6374.7775 4680.0000
lovedTracks             225.3647      134.5440   299.1995   90.8206   38.0000
posts                    20.5230        6.2773    60.2598   14.2456    0.0000
playlists                 0.7441        0.6723     1.4015    0.0718    0.0000
shouts                  101.8195       37.2362   138.8781   64.5833   10.0000
tenure                   46.5487       47.7039    19.0357   -1.1551    1.0000
                    eQQ Mean     eQQ Max
distance              0.1596      0.4517
age                   0.9592      7.0000
male                  0.0214      1.0000
good_country          0.0149      1.0000
friend_cnt           32.5544   4794.0000
avg_friend_age        1.2763     14.0000
avg_friend_male       0.0326      0.1602
friend_country_cnt    4.2942     95.0000
songsListened      6374.7775 566867.0000
lovedTracks          90.8206   6180.0000
posts                14.2456   9535.0000
playlists             0.1035     22.0000
shouts               64.5833  59168.0000
tenure                1.2995      4.0000

Percent Balance Improvement:
                   Mean Diff. eQQ Med  eQQ Mean  eQQ Max
distance              48.2930 57.0083   48.2908  33.9658
age                   40.9972  0.0000   41.1419 -40.0000
male                -187.9614  0.0000 -187.6712   0.0000
good_country         -30.1771  0.0000  -30.3571   0.0000
friend_cnt            25.3162 45.4545   25.3062   0.0000
avg_friend_age        28.3760 72.4916   22.0309 -21.7391
avg_friend_male       14.7957 78.6165   65.9532  55.9466
friend_country_cnt    35.5279 60.0000   35.5203   0.0000
songsListened         66.6825 69.7499   66.6699  13.2836
lovedTracks           43.2906 41.5385   43.2216   2.5698
posts                 20.7676  0.0000   20.3394   0.0000
playlists             66.5567  0.0000   50.5109  15.3846
shouts                24.3724 33.3333   24.1770   0.0000
tenure                65.4771 66.6667   61.1782  60.0000

Sample sizes:
          Control Treated
All         34004    9823
Matched      9823    9823
Unmatched   24181       0
Discarded       0       0
In [67]:
options(repr.plot.width=4, repr.plot.height=3)
plot(match)
In [37]:
# dataframe with only matched observations
df_matched <- match.data(match)
head(df_matched)
subscriber_friend_cntadopteragemalegood_countryfriend_cntavg_friend_ageavg_friend_malefriend_country_cntsongsListenedlovedTrackspostsplaylistsshoutstenuredistanceweights
10 0 22 0 1 8 22.57143 0.4285714 1 9687 194 0 1 8 59 0.085973341
41 0 21 0 0 28 22.94737 0.5000000 7 1357 32 0 0 1 25 0.238940671
50 0 24 0 0 65 22.28302 0.9137931 9 89984 20 2 0 81 67 0.695522081
60 0 21 1 1 12 25.00000 0.7777778 1 124547 10 0 1 2 53 0.223066331
70 0 20 0 1 15 22.00000 0.6363636 1 24852 391 6 1 67 56 0.126440801
81 0 23 1 0 57 23.63636 0.5208333 14 99877 125 89 1 44 71 0.793814531
In [38]:
# examining difference in covariate means of the matched sample
df_matched%>%
  group_by(subscriber_friend_cnt) %>%
  select(one_of(data_cov)) %>%
  summarise_all(funs(mean))
Adding missing grouping variables: `subscriber_friend_cnt`
subscriber_friend_cntagemalegood_countryfriend_cntavg_friend_ageavg_friend_malefriend_country_cntsongsListenedlovedTrackspostsplaylistsshoutstenure
0 26.33238 0.65764020.358139121.46656 26.55723 0.65514515.091418 27360.86 134.5440 6.2773080.6722997 37.2361847.70386
1 25.37321 0.63626180.343276054.02097 25.39043 0.63580779.385626 33735.64 225.3647 20.5229560.7440700101.8195146.54871
In [39]:
lapply(data_cov, function(v) {
  t.test(df_matched[, v] ~ df_matched$subscriber_friend_cnt)
})
[[1]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = 9.0201, df = 19340, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.750746 1.167609
sample estimates:
mean in group 0 mean in group 1 
       26.33238        25.37321 


[[2]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = 3.1356, df = 19640, p-value = 0.001718
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.008014461 0.034742334
sample estimates:
mean in group 0 mean in group 1 
      0.6576402       0.6362618 


[[3]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = 2.183, df = 19642, p-value = 0.02905
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.001517727 0.028208426
sample estimates:
mean in group 0 mean in group 1 
      0.3581391       0.3432760 


[[4]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = -24.809, df = 10486, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -35.12657 -29.98225
sample estimates:
mean in group 0 mean in group 1 
       21.46656        54.02097 


[[5]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = 13.628, df = 18412, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.9989779 1.3346200
sample estimates:
mean in group 0 mean in group 1 
       26.55723        25.39043 


[[6]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = 5.4719, df = 19270, p-value = 4.508e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01241054 0.02626427
sample estimates:
mean in group 0 mean in group 1 
      0.6551451       0.6358077 


[[7]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = -38.564, df = 13868, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.512475 -4.075940
sample estimates:
mean in group 0 mean in group 1 
       5.091418        9.385626 


[[8]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = -11.383, df = 18452, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7472.436 -5277.119
sample estimates:
mean in group 0 mean in group 1 
       27360.86        33735.64 


[[9]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = -15.488, df = 16091, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -102.31423  -79.32702
sample estimates:
mean in group 0 mean in group 1 
       134.5440        225.3647 


[[10]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = -5.6775, df = 11043, p-value = 1.401e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -19.164006  -9.327289
sample estimates:
mean in group 0 mean in group 1 
       6.277308       20.522956 


[[11]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = -2.9528, df = 17787, p-value = 0.003154
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1194130 -0.0241277
sample estimates:
mean in group 0 mean in group 1 
      0.6722997       0.7440700 


[[12]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = -8.5069, df = 10514, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -79.46491 -49.70174
sample estimates:
mean in group 0 mean in group 1 
       37.23618       101.81951 


[[13]]

	Welch Two Sample t-test

data:  df_matched[, v] by df_matched$subscriber_friend_cnt
t = 4.1551, df = 19604, p-value = 3.266e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.6102242 1.7000680
sample estimates:
mean in group 0 mean in group 1 
       47.70386        46.54871 

In [40]:
# Estimating treatment effects using Visual Inspection
fn_bal <- function(df_matched, variable) {
  df_matched$variable <- df_matched[, variable]
  df_matched$subscriber_friend_cnt <- as.factor(df_matched$subscriber_friend_cnt)
  support <- c(min(df_matched$variable), max(df_matched$variable))
  ggplot(df_matched, aes(x = distance, y = variable, color = subscriber_friend_cnt)) +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Propensity score") +
    ylab(variable) +
    theme_bw() +
    ylim(support)
}
In [58]:
options(repr.plot.width=6, repr.plot.height=8)
suppressWarnings(grid.arrange(
  fn_bal(df_matched, "age"),
  fn_bal(df_matched, "male") + theme(legend.position = "none"),
  fn_bal(df_matched, "good_country"),
  fn_bal(df_matched, "friend_cnt") + theme(legend.position = "none"),
  fn_bal(df_matched, "avg_friend_age"),
  fn_bal(df_matched, "avg_friend_male") + theme(legend.position = "none"),
  fn_bal(df_matched, "friend_country_cnt"),
  fn_bal(df_matched, "songsListened") + theme(legend.position = "none"),
  fn_bal(df_matched, "lovedTracks"),
  fn_bal(df_matched, "posts") + theme(legend.position = "none"),
  fn_bal(df_matched, "playlists"),
  fn_bal(df_matched, "shouts") + theme(legend.position = "none"),
  fn_bal(df_matched, "tenure"),
  nrow = 7, widths = c(1, 0.8)
))
In [42]:
# Estimating treatment effects using t-test
# T-test in adopter group 0.18 is higher than the one in non-adopter 0.09, which proves the hypothesis that having 
# subscriber friends affects the likelihood of becoming an adopter 
with(df_matched, t.test(adopter ~ subscriber_friend_cnt))
	Welch Two Sample t-test

data:  adopter by subscriber_friend_cnt
t = -18.938, df = 18060, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.10009352 -0.08131745
sample estimates:
mean in group 0 mean in group 1 
     0.08683702      0.17754250 
In [43]:
# OLS without covariates
treat_wt <- lm(adopter ~ subscriber_friend_cnt, data = df_matched)
summary(treat_wt)
Call:
lm(formula = adopter ~ subscriber_friend_cnt, data = df_matched)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.17754 -0.17754 -0.08684 -0.08684  0.91316 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.086837   0.003387   25.64   <2e-16 ***
subscriber_friend_cnt 0.090705   0.004790   18.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3357 on 19644 degrees of freedom
Multiple R-squared:  0.01793,	Adjusted R-squared:  0.01788 
F-statistic: 358.7 on 1 and 19644 DF,  p-value: < 2.2e-16
In [44]:
# with covariates
treat_with<- lm(adopter ~ subscriber_friend_cnt +  age + male + good_country + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt + songsListened + lovedTracks + posts + playlists + shouts + tenure, data = df_matched)
summary(treat_with)
Call:
lm(formula = adopter ~ subscriber_friend_cnt + age + male + good_country + 
    friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt + 
    songsListened + lovedTracks + posts + playlists + shouts + 
    tenure, data = df_matched)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.27876 -0.15553 -0.10616 -0.05705  1.00012 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -3.618e-02  1.338e-02  -2.705 0.006844 ** 
subscriber_friend_cnt  7.663e-02  4.901e-03  15.635  < 2e-16 ***
age                    1.764e-03  4.596e-04   3.838 0.000125 ***
male                   3.070e-02  5.125e-03   5.991 2.12e-09 ***
good_country          -3.889e-02  5.014e-03  -7.756 9.22e-15 ***
friend_cnt            -1.620e-05  3.595e-05  -0.451 0.652317    
avg_friend_age         1.556e-03  5.799e-04   2.684 0.007282 ** 
avg_friend_male        7.275e-03  9.820e-03   0.741 0.458791    
friend_country_cnt     1.055e-03  4.454e-04   2.368 0.017884 *  
songsListened          6.209e-07  6.610e-08   9.394  < 2e-16 ***
lovedTracks            8.509e-05  5.971e-06  14.250  < 2e-16 ***
posts                  2.819e-05  1.355e-05   2.081 0.037461 *  
playlists              7.252e-03  1.407e-03   5.155 2.56e-07 ***
shouts                 1.295e-05  4.562e-06   2.838 0.004538 ** 
tenure                -3.098e-04  1.322e-04  -2.344 0.019095 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3304 on 19631 degrees of freedom
Multiple R-squared:  0.04936,	Adjusted R-squared:  0.04868 
F-statistic: 72.81 on 14 and 19631 DF,  p-value: < 2.2e-16

The difference in subscriber_friend_cnt coefficients between regressions with or without covariates is very small, which indicates that PSM reduced the bias and we were able to estimate the effect of a treatment.

Regression analysis

In [45]:
#Logistic Regression

result <- glm(adopter ~ male + age + subscriber_friend_cnt + friend_cnt + avg_friend_age + friend_country_cnt + songsListened + lovedTracks + good_country + playlists + tenure + shouts + posts + avg_friend_male,
              family = binomial(), data = data)
summary(result)
Call:
glm(formula = adopter ~ male + age + subscriber_friend_cnt + 
    friend_cnt + avg_friend_age + friend_country_cnt + songsListened + 
    lovedTracks + good_country + playlists + tenure + shouts + 
    posts + avg_friend_male, family = binomial(), data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.6288  -0.3990  -0.3240  -0.2678   2.7604  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -4.213e+00  9.562e-02 -44.062  < 2e-16 ***
male                   4.139e-01  4.175e-02   9.914  < 2e-16 ***
age                    2.103e-02  3.517e-03   5.979 2.24e-09 ***
subscriber_friend_cnt  9.719e-01  4.211e-02  23.080  < 2e-16 ***
friend_cnt            -4.584e-04  2.972e-04  -1.543 0.122942    
avg_friend_age         2.369e-02  4.637e-03   5.108 3.25e-07 ***
friend_country_cnt     1.401e-02  3.646e-03   3.843 0.000122 ***
songsListened          6.152e-06  5.212e-07  11.805  < 2e-16 ***
lovedTracks            6.148e-04  4.828e-05  12.734  < 2e-16 ***
good_country          -3.939e-01  4.077e-02  -9.661  < 2e-16 ***
playlists              6.467e-02  1.310e-02   4.938 7.89e-07 ***
tenure                -4.929e-03  1.024e-03  -4.812 1.49e-06 ***
shouts                 7.416e-05  6.476e-05   1.145 0.252113    
posts                  1.074e-04  9.027e-05   1.189 0.234260    
avg_friend_male        1.047e-01  6.555e-02   1.597 0.110222    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24537  on 43826  degrees of freedom
Residual deviance: 22198  on 43812  degrees of freedom
AIC: 22228

Number of Fisher Scoring iterations: 5
In [60]:
exp(result$coefficients)-1
(Intercept)
-0.985198383371094
male
0.512735699755603
age
0.0212535613684131
subscriber_friend_cnt
1.64283836579023
friend_cnt
-0.000458334648930192
avg_friend_age
0.0239687262558521
friend_country_cnt
0.014110551524471
songsListened
6.15251230984271e-06
lovedTracks
0.000615002727587299
good_country
-0.325580446450224
playlists
0.0668115161049347
tenure
-0.0049164469117472
shouts
7.41647200452888e-05
posts
0.000107372614990275
avg_friend_male
0.110371072878788

Conclusion

All the variables in this regression output have low p-values, which indicate they are all significant in the model. Most of the variables (besides tenure, friend_cnt, and good_country) have linear relationships with the adopter variable. It indicates that a one-unit increase in any of these variables, we expect an increase in the log-odds of the dependent variable adopter. Whereas tenure and goo_country have inverse relationships with adopter variable, a one-unit increase in these variables, we expect a decrease in the log-odds of the dependent variable. From this analysis we can conclude that the higher peer influence (many subscriber friends, high diversity of friends), the more likely a free user can be converted to a premium subscriber. Another condition like user engagement positively affects a "free-to-fee" strategy. The higher engagement (songs listened, playlists, tracks, etc.) of a user, the more chances to convert him to a premium user. Recommendations to Highnote for new subscribers:

1) target groups with high social engagement

2) motivate premuim users to attract their friends to their paid services. peer subscribers can influence their friends to subscribe for a premium account.

3) Highnote has more chances to obtain a new subscriber if they target a male in his late 20s or 30s who is not from US, UK or Germany.