class: title-slide # Econ 520: Data Science for Economists ## Lecture 11: A/B Testing <br> <p align=center> Pedro H. C. Sant'Anna </p> <div style="margin-top: -.7cm;"></div> <p align=center> Emory University </p> <br> <p align=center> Spring 2024 </p> --- class: center, middle name: prologue # Lecture Structure <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Main Goals - The main goal of this lecture is to leverage how we can analyze data from A/B tests (RCTs) - We will first discuss a case study. - Then, we will flip the classroom and work on exercises in class. - This exercise will become a PS that is due on March 25, 2024. --- class: center, middle name: Causal Inference # Potential Outcomes and Fundamental Problem of Causal Inference <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Potential Outcomes - As we have discussed in previous 3 lectures, we are now interested in analyzing the effect of a treatment (D) on outcomes (Y) - To formulate the problem, we have adopted the potential outcome framework: - `\(Y_i(0)\)` = the potential outcome for unit i if the treatment is not applied - `\(Y_i(1)\)` = the potential outcome for unit i if the treatment is applied - Individual treatment effects are defined as: - `\(Y_i(1) - Y_i(0)\)` = the individual treatment effect for individual i - .hi[Fundamental Problem of Causal Inference:] <br> For each `\(i\)`, we can only observe `\(Y_i(1)\)` or `\(Y_i(0)\)`, but not both. --- # Parameters of Interest - The fact that we cannot observe both potential outcomes for the same individual does not preclude us to formulate causal questions that involve both POs. - However, we will recognize that we cannot estimate individual treatment effects, at least not in a realistic manner. - Thus, we will focus on learning about average treatment effects (ATEs) and average treatment effects on the treated (ATTs). - `\(ATE = E[Y(1)-Y(0)]\)` - `\(ATT = E[Y(1)-Y(0)|D=1]\)` - Now, if we cannot see `\(Y(0)\)` and `\(Y(1)\)`, how can we identify these parameters? --- class: center, middle name: Causal Inference # A/B Testing <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # A/B tests - In general, a simple comparison of means do not recover the ATE or ATT, as we have seen in the previous lectures: `$$\begin{eqnarray*} E[Y|D=1] - E[Y|D=0] &=& ATT + (E[Y(0)|D=1] - E[Y(0)|D=0])\\ \\ & =& ATT + Selection~Bias \end{eqnarray*}$$` - However, in the context of A/B tests, we can recover the ATE and ATT. - That is because treatment `\(D\)` is randomly assigned, i.e., `\(D \perp Y(0), Y(1)\)`. --- # A/B tests - In general, a simple comparison of means do not recover the ATE or ATT, as we have seen in the previous lectures: `$$\begin{eqnarray*} E[Y|D=1] - E[Y|D=0] &=& ATT + (E[Y(0)|D=1] - E[Y(0)|D=0])\\ \\ & =& ATT + Selection~Bias \end{eqnarray*}$$` - However, in the context of A/B tests, we can recover the ATE and ATT. - That is because treatment `\(D\)` is randomly assigned, i.e., `\(D \perp Y(0), Y(1)\)`. - In this case, we have `$$\begin{eqnarray*} E[Y|D=1] - E[Y|D=0] = ATT =ATE \end{eqnarray*}$$` --- # A/B tests, in practice - In order to estimate the `\(ATE\)`, we can leverage a simple regression model, $$ Y_i = \alpha + \beta D_i + \epsilon_i,$$ and it is straightforward to show that `\(\beta = ATE\)`. - Thus, OLS estimates of `\(\beta\)` will be unbiased and consistent for the ATE. - We can also conduct valid inference for `\(\beta\)` using standard inference procedures from regression analysis .hi[(that you have learned in Econ 320).] --- class: center, middle name: Causal Inference # A/B Testing, in practice <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Case Study from Kaggle - The dataset we will use is from [Kaggle](https://www.kaggle.com/datasets/faviovaz/marketing-ab-testing?) - Marketing companies want to run successful campaigns, but the market is complex and several options can work. - The companies are interested in answering two questions: <div style="margin-top: -.5cm;"></div> - Would the campaign be successful? - If the campaign was successful, how much of that success could be attributed to the ads? - So normally they turn to A/B tests: <div style="margin-top: -.5cm;"></div> - two or more versions of a variable (web page, page element, banner, etc.) are shown to different people at the same time to see which performs better. --- # Case Study from Kaggle: The A/B test design - To check which type of ad is more successful, the company can run an A/B test. - There are many possible designs, one of them being: - The majority of the people will be exposed to ads (the experimental group). - And a small portion of people (the control group) would instead see a Public Service Announcement (PSA) (or nothing) in the exact size and place the ad would normally be. - Let's look at an example of this. --- # Data Dictionary - The data file is located at (https://psantanna.com/Econ520/files/marketing_AB.csv) - This data is from [Kaggle](https://www.kaggle.com/datasets/faviovaz/marketing-ab-testing?) - The data dictionary is as follows: - Index: Row index - user.id: User ID (unique) - test.group: If "ad" the person saw the advertisement, if "psa" they only saw the public service announcement - converted: If a person bought the product then True, else is False - total.ads: Amount of ads seen by person - most.ads.day: Day that the person saw the biggest amount of ads - most.ads.hour: Hour of day that the person saw the biggest amount of ads --- # Start the analysis ```r # Load the necessary libraries library(tidyverse) library(estimatr) # library for regression with robust standard errors # Load the data ab_mkt <- read.csv(url("https://psantanna.com/Econ520/files/marketing_AB.csv")) # Transform the data as tibble ab_mkt <- as_tibble(ab_mkt) # Show the first rows of the data head(ab_mkt) ``` ``` ## # A tibble: 6 × 7 ## X user.id test.group converted total.ads most.ads.day most.ads.hour ## <int> <int> <chr> <chr> <int> <chr> <int> ## 1 0 1069124 ad False 130 Monday 20 ## 2 1 1119715 ad False 93 Tuesday 22 ## 3 2 1144181 ad False 21 Tuesday 18 ## 4 3 1435133 ad False 355 Tuesday 10 ## 5 4 1015700 ad False 276 Friday 14 ## 6 5 1137664 ad False 734 Saturday 10 ``` ```r # Sample size nrow(ab_mkt) ``` ``` ## [1] 588101 ``` --- # Some sanity checks ```r # Let's check if all users are unique unique_users <- unique(ab_mkt$user.id) length(unique_users) == nrow(ab_mkt) ``` ``` ## [1] TRUE ``` ```r # Check propotion of missing data ab_mkt %>% summarise_all(~sum(is.na(.))/n()) ``` ``` ## # A tibble: 1 × 7 ## X user.id test.group converted total.ads most.ads.day most.ads.hour ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 0 0 0 0 0 0 ``` --- # Some data cleaning ```r # Create Numerical variable for most ads days of the week and converted ab_mkt <- ab_mkt %>% mutate(converted = 1*(converted == "True"), most.ads.day.num = as.numeric( factor(most.ads.day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) ) ) ``` --- # Some summary statistics ```r # Check how many units are treated and control, # together with the some summary statistics ab_mkt %>% group_by(test.group) %>% summarize(n = n(), # Number of treated and untreated units prop = n()/nrow(ab_mkt), # Proportion of treated and untreated units conversion_rate = mean(converted), # Mean conversion rate avg_total_ads = mean(total.ads), # Mean total ads med_total_ads = median(total.ads), # Median total ads avg_most_ads_hour = mean(most.ads.hour), # Mean most ads hour average_most_ads_day = mean(most.ads.day.num) # Mean most ads day ) ``` ``` ## # A tibble: 2 × 8 ## test.group n prop conversion_rate avg_total_ads med_total_ads avg_most_ads_hour average_most_ads_day ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ad 564577 0.960 0.0255 24.8 13 14.5 4.03 ## 2 psa 23524 0.0400 0.0179 24.8 12 14.3 3.95 ``` --- # Check if effect of campaign is statistically significant ```r # Estimate the ATE lm_ate <- lm_robust(converted ~ test.group, data = ab_mkt) summary(lm_ate) ``` ``` ## ## Call: ## lm_robust(formula = converted ~ test.group, data = ab_mkt) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 0.025547 0.0002100 121.660 0.000e+00 0.025135 0.025958 588099 ## test.grouppsa -0.007692 0.0008886 -8.657 4.848e-18 -0.009434 -0.005951 588099 ## ## Multiple R-squared: 9.236e-05 , Adjusted R-squared: 9.066e-05 ## F-statistic: 74.95 on 1 and 588099 DF, p-value: < 2.2e-16 ``` ```r # Notice that treatment here is the omission of campaign # So things are ``reversed'' ``` --- # Change the treatment variable ```r # Treatment is the ad ab_mkt <- ab_mkt %>% mutate(treated = ifelse(test.group == "ad", 1, 0)) # Estimate the ATE lm_ate <- lm_robust(converted ~ treated, data = ab_mkt) summary(lm_ate) ``` ``` ## ## Call: ## lm_robust(formula = converted ~ treated, data = ab_mkt) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 0.017854 0.0008634 20.679 5.801e-95 0.016162 0.019546 588099 ## treated 0.007692 0.0008886 8.657 4.848e-18 0.005951 0.009434 588099 ## ## Multiple R-squared: 9.236e-05 , Adjusted R-squared: 9.066e-05 ## F-statistic: 74.95 on 1 and 588099 DF, p-value: < 2.2e-16 ``` --- # Let's check if treatment also has an impact on the total ads ```r # Check if the treatment and control groups are balanced with respect to total ads lm_ads <- lm_robust(total.ads ~ treated, data = ab_mkt) lm_ads ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 24.76113756 0.2794498 88.6067315 0.0000000 24.2134248 25.3088503 588099 ## treated 0.06222754 0.2854515 0.2179969 0.8274316 -0.4972482 0.6217033 588099 ``` What are the possible mechanisms that could justify these results? --- # What about the time of the day and day of the week? ```r lm_most_ads_hour <- lm_robust(most.ads.hour ~ treated, data = ab_mkt) lm_most_ads_hour ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 14.304923 0.03035845 471.200628 0.00000e+00 14.2454210 14.3644242 588099 ## treated 0.170977 0.03103480 5.509203 3.60613e-08 0.1101498 0.2318042 588099 ``` ```r lm_most_ads_day <- lm_robust(most.ads.day.num ~ treated, data = ab_mkt) lm_most_ads_day ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 3.95264411 0.01270701 311.060029 0.000000e+00 3.92773877 3.9775494 588099 ## treated 0.07592595 0.01298450 5.847428 4.994931e-09 0.05047674 0.1013752 588099 ``` But do these regressions make (economic) sense? .hi[I think no!] --- class: center, middle name: Causal Inference # Exercise and Problem Set <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Problem Description - Now, let's move to a more involving exercise, one that you would need to check for more stuff. - In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others. - These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). - In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. --- # Problem Description - Actually, there are six treatment groups in the experiments, but following Bilias (2000) we merge the groups 4 and 6. - In the control group the current rules of the UI applied. - Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. - The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see (http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/readme.b.txt) for further details on data. --- # Part I: some data management - The data for the analysis is describe [here](http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/), and you can download it from (https://psantanna.com/Econ520/files/penn_jae.dat) - You should load the data into R and check the structure of the data. - You should merge treatment groups 4 and 6 into a single treatment group. - Once done that, keep only observations from the control group and the merged treatment group. - How many observations are treated and how many are in the control group? What are the relative size of these groups? - Are there missing data in the dataset? If so, how many? For which variables --- # Part II: Analyzing the data - We are particularly interested in assessing the effect of treatment in the log of the unemployment duration, so you should create a new variable `log_duration` that is the natural logarithm of the variable `inuidur1`. - Provide some summary statistics by treatment group for the following variables: `inuidur1`, `log_duration`, `female`, `black`, `othrace`, `agelt35`, and `agegt54`. Does the treatment and control groups differ in terms of these variables? - Estimate the `\(ATE\)` of the treatment on the log of the unemployment duration using a simple linear regression model, and interpret the results. - What is the point estimate? - Are they statistically significant? - What is the 95% confidence interval? --- # Part III: Heterogeneity Analysis - Does the `\(ATE\)` estimates vary according to being race (white, black, or other race)? - Does the `\(ATE\)` estimates vary according to age (less than 35 years, between 35 and 54 years, and more than 54 years)? - Does the `\(ATE\)` estimates vary according gender and race? .hi[To answer all these questions, you need to not only compute the group-specific `\(ATE\)`s, but also how they vary across groups and also whether such variations are statistically significant]. --- # Part IV: Conclusion - What have you learned in this exercise? - What would be the next steps in this analysis? - What are the limitations of the analysis? - Summarize the main findings in two pages, with the most useful insights. Include the most important plots and tables. - Think of this as an executive summary. - Write this in a Markdown file and export it as a .pdf or html file in the `docs` folder in your project.