Lecture 11

# Econ 520: Data Science for Economists

## Lecture 11: A/B Testing
<br>

<p align=center>
Pedro H. C. Sant'Anna
</p>
<div style="margin-top: -.7cm;"></div>
<p align=center>
Emory University
</p>
<br>
<p align=center>
Spring 2024
</p>

---
class: center, middle
name: prologue

# Lecture Structure

---
# Main Goals

- The main goal of this lecture is to leverage how we can analyze data from A/B tests (RCTs)

- We will first discuss a case study.

- Then, we will flip the classroom and work on exercises in class.

- This exercise will become a PS that is due on March 25, 2024.

---
class: center, middle
name: Causal Inference

# Potential Outcomes and Fundamental Problem of Causal Inference

---

# Potential Outcomes

- As we have discussed in previous 3 lectures, we are now interested in analyzing the effect of a treatment (D) on outcomes (Y)

- To formulate the problem, we have adopted the potential outcome framework:
  - `$Y_i(0)$` = the potential outcome for unit i if the treatment is not applied
  - `$Y_i(1)$` = the potential outcome for unit i if the treatment is applied
  
- Individual treatment effects are defined as:
  - `$Y_i(1) - Y_i(0)$` = the individual treatment effect for individual i
  
- .hi[Fundamental Problem of Causal Inference:] <br>
  For each `$i$`, we can only observe `$Y_i(1)$` or `$Y_i(0)$`, but not both.

---

# Parameters of Interest

- The fact that we cannot observe both potential outcomes for the same individual does not preclude us to formulate causal questions that involve both POs.

- However, we will recognize that we cannot estimate individual treatment effects, at least not in a realistic manner.

- Thus, we will focus on learning about average treatment effects (ATEs) and average treatment effects on the treated (ATTs).
  - `$ATE = E[Y(1)-Y(0)]$`
  - `$ATT = E[Y(1)-Y(0)|D=1]$`
  
- Now, if we cannot see `$Y(0)$` and `$Y(1)$`, how can we identify these parameters?
---

# A/B Testing

---

# A/B tests

- In general, a simple comparison of means do not recover the ATE or ATT, as we have seen in the previous lectures:

`$$\begin{eqnarray*}
E[Y|D=1] - E[Y|D=0] &=& ATT + (E[Y(0)|D=1] - E[Y(0)|D=0])\\
\\
& =& ATT + Selection~Bias
\end{eqnarray*}$$`

- However, in the context of A/B tests, we can recover the ATE and ATT.

- That is because treatment `$D$` is randomly assigned, i.e., `$D \perp Y(0), Y(1)$`.

---
# A/B tests

- In general, a simple comparison of means do not recover the ATE or ATT, as we have seen in the previous lectures:

`$$\begin{eqnarray*}
E[Y|D=1] - E[Y|D=0] &=& ATT + (E[Y(0)|D=1] - E[Y(0)|D=0])\\
\\
& =& ATT + Selection~Bias
\end{eqnarray*}$$`

- However, in the context of A/B tests, we can recover the ATE and ATT.

- That is because treatment `$D$` is randomly assigned, i.e., `$D \perp Y(0), Y(1)$`.

- In this case, we have 
`$$\begin{eqnarray*}
E[Y|D=1] - E[Y|D=0] = ATT =ATE
\end{eqnarray*}$$`
---
# A/B tests, in practice

- In order to estimate the `$ATE$`, we can leverage a simple regression model,

$$ Y_i = \alpha + \beta D_i + \epsilon_i,$$
and it is straightforward to show that `$\beta = ATE$`.

- Thus, OLS estimates of `$\beta$` will be unbiased and consistent for the ATE.

- We can also conduct valid inference for `$\beta$` using standard inference procedures from regression analysis .hi[(that you have learned in Econ 320).]
---

# A/B Testing, in practice

---
# Case Study from Kaggle
- The dataset we will use is from [Kaggle](https://www.kaggle.com/datasets/faviovaz/marketing-ab-testing?)

- Marketing companies want to run successful campaigns, but the market is complex and several options can work.

- The companies are interested in answering two questions:
<div style="margin-top: -.5cm;"></div>
  - Would the campaign be successful?
  - If the campaign was successful, how much of that success could be attributed to the ads? 
  
- So normally they turn to A/B tests:
<div style="margin-top: -.5cm;"></div>
  - two or more versions of a variable (web page, page element, banner, etc.) are shown to different people at the same time to see which performs better.

---
# Case Study from Kaggle: The A/B test design

- To check which type of ad is more successful, the company can run an A/B test.

- There are many possible designs, one of them being:

- The majority of the people will be exposed to ads (the experimental group).

- And a small portion of people (the control group) would instead see a Public Service Announcement (PSA) (or nothing) in the exact size and place the ad would normally be.

- Let's look at an example of this.

---
# Data Dictionary

- The data file is located at (https://psantanna.com/Econ520/files/marketing_AB.csv)
- This data is from [Kaggle](https://www.kaggle.com/datasets/faviovaz/marketing-ab-testing?)
- The data dictionary is as follows:
    - Index: Row index
    - user.id: User ID (unique)
    - test.group: If "ad" the person saw the advertisement, if "psa" they only saw the public service announcement
    - converted: If a person bought the product then True, else is False
    - total.ads: Amount of ads seen by person
    - most.ads.day: Day that the person saw the biggest amount of ads
    - most.ads.hour: Hour of day that the person saw the biggest amount of ads

---
# Start the analysis

```r
# Load the necessary libraries
library(tidyverse)
library(estimatr) # library for regression with robust standard errors
# Load the data
ab_mkt <- read.csv(url("https://psantanna.com/Econ520/files/marketing_AB.csv"))
# Transform the data as tibble
ab_mkt <- as_tibble(ab_mkt)
# Show the first rows of the data
head(ab_mkt)
```

```
## # A tibble: 6 × 7
##       X user.id test.group converted total.ads most.ads.day most.ads.hour
##   <int>   <int> <chr>      <chr>         <int> <chr>                <int>
## 1     0 1069124 ad         False           130 Monday                  20
## 2     1 1119715 ad         False            93 Tuesday                 22
## 3     2 1144181 ad         False            21 Tuesday                 18
## 4     3 1435133 ad         False           355 Tuesday                 10
## 5     4 1015700 ad         False           276 Friday                  14
## 6     5 1137664 ad         False           734 Saturday                10
```

```r
# Sample size
nrow(ab_mkt)
```

```
## [1] 588101
```

---
# Some sanity checks

```r
# Let's check if all users are unique
unique_users <- unique(ab_mkt$user.id)
length(unique_users) == nrow(ab_mkt)
```

```
## [1] TRUE
```

```r
# Check propotion of missing data
ab_mkt %>% 
  summarise_all(~sum(is.na(.))/n())
```

```
## # A tibble: 1 × 7
##       X user.id test.group converted total.ads most.ads.day most.ads.hour
##   <dbl>   <dbl>      <dbl>     <dbl>     <dbl>        <dbl>         <dbl>
## 1     0       0          0         0         0            0             0
```

---
# Some data cleaning

```r
# Create Numerical variable for most ads days of the week and converted
ab_mkt <- ab_mkt %>% 
  mutate(converted = 1*(converted == "True"),
    most.ads.day.num = as.numeric(
    factor(most.ads.day,
           levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
  )
  )
```
---
# Some summary statistics

```r
# Check how many units are treated and control, 
# together with the some summary statistics
ab_mkt %>% 
  group_by(test.group) %>% 
  summarize(n = n(), # Number of treated and untreated units
            prop = n()/nrow(ab_mkt), # Proportion of treated and untreated units
            conversion_rate = mean(converted), # Mean conversion rate
            avg_total_ads = mean(total.ads), # Mean total ads
            med_total_ads = median(total.ads), # Median total ads
            avg_most_ads_hour = mean(most.ads.hour), # Mean most ads hour
            average_most_ads_day = mean(most.ads.day.num) # Mean most ads day
            )
```

```
## # A tibble: 2 × 8
##   test.group      n   prop conversion_rate avg_total_ads med_total_ads avg_most_ads_hour average_most_ads_day
##   <chr>       <int>  <dbl>           <dbl>         <dbl>         <dbl>             <dbl>                <dbl>
## 1 ad         564577 0.960           0.0255          24.8            13              14.5                 4.03
## 2 psa         23524 0.0400          0.0179          24.8            12              14.3                 3.95
```

---
# Check if effect of campaign is statistically significant

```r
# Estimate the ATE
lm_ate <- lm_robust(converted ~ test.group, data = ab_mkt)
summary(lm_ate)
```

```
## 
## Call:
## lm_robust(formula = converted ~ test.group, data = ab_mkt)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                Estimate Std. Error t value  Pr(>|t|)  CI Lower  CI Upper     DF
## (Intercept)    0.025547  0.0002100 121.660 0.000e+00  0.025135  0.025958 588099
## test.grouppsa -0.007692  0.0008886  -8.657 4.848e-18 -0.009434 -0.005951 588099
## 
## Multiple R-squared:  9.236e-05 ,	Adjusted R-squared:  9.066e-05 
## F-statistic: 74.95 on 1 and 588099 DF,  p-value: < 2.2e-16
```

```r
# Notice that treatment here is the omission of campaign
# So things are ``reversed''
```

---
# Change the treatment variable

```r
# Treatment is the ad
ab_mkt <- ab_mkt %>% 
  mutate(treated = ifelse(test.group == "ad", 1, 0))

# Estimate the ATE
lm_ate <- lm_robust(converted ~ treated, data = ab_mkt)
summary(lm_ate)
```

```
## 
## Call:
## lm_robust(formula = converted ~ treated, data = ab_mkt)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper     DF
## (Intercept) 0.017854  0.0008634  20.679 5.801e-95 0.016162 0.019546 588099
## treated     0.007692  0.0008886   8.657 4.848e-18 0.005951 0.009434 588099
## 
## Multiple R-squared:  9.236e-05 ,	Adjusted R-squared:  9.066e-05 
## F-statistic: 74.95 on 1 and 588099 DF,  p-value: < 2.2e-16
```

---

# Let's check if treatment also has an impact on the total ads

```r
# Check if the treatment and control groups are balanced with respect to total ads 
lm_ads <- lm_robust(total.ads ~ treated, data = ab_mkt)
lm_ads
```

```
##                Estimate Std. Error    t value  Pr(>|t|)   CI Lower   CI Upper     DF
## (Intercept) 24.76113756  0.2794498 88.6067315 0.0000000 24.2134248 25.3088503 588099
## treated      0.06222754  0.2854515  0.2179969 0.8274316 -0.4972482  0.6217033 588099
```
What are the possible mechanisms that could justify these results?

---
# What about the time of the day and day of the week?

```r
lm_most_ads_hour <- lm_robust(most.ads.hour ~ treated, data = ab_mkt)
lm_most_ads_hour
```

```
##              Estimate Std. Error    t value    Pr(>|t|)   CI Lower   CI Upper     DF
## (Intercept) 14.304923 0.03035845 471.200628 0.00000e+00 14.2454210 14.3644242 588099
## treated      0.170977 0.03103480   5.509203 3.60613e-08  0.1101498  0.2318042 588099
```

```r
lm_most_ads_day <- lm_robust(most.ads.day.num ~ treated, data = ab_mkt)
lm_most_ads_day
```

```
##               Estimate Std. Error    t value     Pr(>|t|)   CI Lower  CI Upper     DF
## (Intercept) 3.95264411 0.01270701 311.060029 0.000000e+00 3.92773877 3.9775494 588099
## treated     0.07592595 0.01298450   5.847428 4.994931e-09 0.05047674 0.1013752 588099
```
But do these regressions make (economic) sense? .hi[I think no!]
---

# Exercise and Problem Set

---
# Problem Description

- Now, let's move to a more involving exercise, one that you would need to check for more stuff.

- In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others.

- These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI).

- In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups.

---
# Problem Description

- Actually, there are six treatment groups in the experiments, but following Bilias (2000) we merge the groups 4 and 6.

- In the control group the current rules of the UI applied.

- Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration.

- The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see (http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/readme.b.txt) for further details on data.
---

# Part I: some data management

- The data for the analysis is describe [here](http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/), and you can download it from (https://psantanna.com/Econ520/files/penn_jae.dat)

- You should load the data into R and check the structure of the data.

- You should merge treatment groups 4 and 6 into a single treatment group.

- Once done that, keep only observations from the control group and the merged treatment group.

- How many observations are treated and how many are in the control group? What are the relative size of these groups?

- Are there missing data in the dataset? If so, how many? For which variables
---
# Part II: Analyzing the data

- We are particularly interested in assessing the effect of treatment in the log of the unemployment duration, so you should create a new variable `log_duration` that is the natural logarithm of the variable `inuidur1`.

- Provide some summary statistics by treatment group for the following variables: `inuidur1`, `log_duration`, `female`, `black`, `othrace`, `agelt35`, and `agegt54`. Does the treatment and control groups differ in terms of these variables?

- Estimate the `$ATE$` of the treatment on the log of the unemployment duration using a simple linear regression model, and interpret the results.
  - What is the point estimate?
  - Are they statistically significant?
  - What is the 95% confidence interval?
---

# Part III: Heterogeneity Analysis

- Does the `$ATE$` estimates vary according to being race (white, black, or other race)?

- Does the `$ATE$` estimates vary according to age (less than 35 years, between 35 and 54 years, and more than 54 years)?

- Does the `$ATE$` estimates vary according gender and race?

.hi[To answer all these questions, you need to not only compute the group-specific `$ATE$`s, but also how they vary across groups and also whether such variations are statistically significant].
---
# Part IV: Conclusion

- What have you learned in this exercise?

- What would be the next steps in this analysis?

- What are the limitations of the analysis?

- Summarize the main findings in two pages, with the most useful insights. Include the most important plots and tables.
  - Think of this as an executive summary.
  - Write this in a Markdown file and export it as a .pdf or html file in the `docs` folder in your project.