Lecture 13

class: title-slide

# Econ 520: Data Science for Economists

## Lecture 13: Linear Regression Adjustments under Unconfoundedness: Estimation
<br>

<p align=center>
Pedro H. C. Sant'Anna
</p>
<div style="margin-top: -.7cm;"></div>
<p align=center>
Emory University
</p>
<br>
<p align=center>
Spring 2024
</p>

---
class: center, middle
name: prologue

# Main Goal
<style type="text/css">
.small-code .remark-code{
  font-size: 40%
}

.medium-code .remark-code{
  font-size: 55%
}
</style>

---
# Main Goals

- The main goal of this lecture is to discuss how we can use linear regression adjustments for causal inference under unconfoundedness.

- We will cover the principles of it, in theory and in practice.

- We will illustrate in a simulated exercise.

- Introduce Monte Carlo Simulations

- Form basis for a problem set due on April 4th.

---
class: center, middle
name: CI

# Causal Inference under Unconfoundedness

---
# Causal Inference without an Experiment

- We have discussed in Slide decks 10 and 11 that, with A/B testing, we can estimate causal effects by using simple comparisons of means,
`$$\begin{eqnarray*}
E[Y|D=1] - E[Y|D=0] = ATT =ATE.
\end{eqnarray*}$$`

- That follows because the treatment is randomly assigned, i.e., `$D \perp Y(0), Y(1)$`.

- In many practical situations, however, we do not have the luxury of an experiment.

- How can we uncover average treatment effects in these cases?

---
# Unconfooundedness

- The usual starting point to tackle this problem is the assumption of .hi[unconfoundedness]
`$$\begin{eqnarray*}
Y(1), Y(0) \perp D|X.
\end{eqnarray*}$$`

- This assumption is also known as .hi[selection on observables].

- Here, `$X$` is a set of observed covariates that affect both the treatment assignment and the potential outcomes.

- We usually refer to `$X$` as confounders.

---
# Unconfooundedness + Overlap

- We will also impose the .hi[overlap] assumption, i.e., for some `$0<\varepsilon<0.5$`, `$\varepsilon    < P(D=1|X) < 1-\varepsilon$`.

- This implies that every units with a given value of `$X$` has a positive probability of receiving either treatment or control.

- It is also known as the .hi[common support] assumption.

- This is important because it ensures that we can estimate the treatment effect for all units in the population.

- This assumption rules out deterministic treatment assignment.

- .hi[Why is this important?]

---
# How does unconfoundedness + overlap help?

- The unconfoundedness + overlap assumptions imply that we can write the conditional expectation of the potential outcomes as
`$$\begin{eqnarray*}
E[Y(1)|X] = E[Y(1)|D=1,X] = E[Y|D=1,X].
\end{eqnarray*}$$`

- Analogously, we have that
`$$\begin{eqnarray*}
E[Y(0)|X] = E[Y(0)|D=1,X] = E[Y|D=0,X].
\end{eqnarray*}$$`

- Thus, we can identify the conditional ATE (CATE) given `$X$` as
`$$\begin{eqnarray*}
CATE(X) &\equiv& E[Y(1)- Y(0)|X] \\
&=& E[Y(1)|X] - E[Y(0)|X] \\
&=& E[Y|D=1,X] - E[Y|D=0,X].
\end{eqnarray*}$$`

---
# Q1 of Problem Set 3: CATE(X) vs CATT(X)

- The conditional average treatment effect on the treated (CATT) given `$X$` is defined as
`$$\begin{eqnarray*}
CATT(X) &\equiv& E[Y(1)- Y(0)|D=1,X] \\
\end{eqnarray*}$$`

- .hi[Question:] Under unconfoundednss + overlap, is the `$CATE(X)$` different from the `$CATT(X)$`?

- .hi[Prove it mathematically if this is the case or not.] (Question 1 of PS 3).
---
# From CATE(X) to ATE and ATT

- The CATE(X) is the average treatment effect for a given value of `$X$`.

- This is a functional parameter: it can take a different value for each value of `$X$`.
  - .hi[Why is this important?]

- With A/B test, however, our main interest was not into the CATE(X), but into the ATE and ATT.

- .hi[How can we go from the CATE(X) to the ATE and ATT?]
  - We can integrate the CATE(X) over the distribution of `$X$` to get the ATE
  - We can average the CATE(X) over the distribution of `$X$` among treated to get the ATT.

---
# From CATE(X) to ATE

- Let's show this in math terms!

- Given our previous results and the Law of Iterated Expectations, we have that
`$$\begin{eqnarray*}
ATE &\equiv& E[Y(1)- Y(0)] \\
&=& E[E[Y(1)- Y(0)|X]] \\
&=& E[CATE(X)]\\
&=& E[E[Y|D=1,X] - E[Y|D=0,X]]\\
&=& E[E[Y|D=1,X]] - E[E[Y|D=0,X]]\\
\end{eqnarray*}$$`
---
# From CATE(X) to ATT

- .hi[Q2 of Problem Set 3: From CATE(X) to ATT]

- Prove that, stating and justifying each step of the calculations that, under unconfoundedness + overlap, the ATT can be written as
`$$\begin{eqnarray*}
ATT &\equiv& E[Y(1)- Y(0)|D=1]  = E[Y|D=1] - E[E[Y|D=0,X]|D=1].
\end{eqnarray*}$$`
---
# Plug-in estimators

- We can estimate the ATE and ATT by plugging in the estimates of the conditional expectations.

- To do that, all we need to do is come up with a model for `$E[Y|D=1,X]$` and `$E[Y|D=0,X]$`.

- But .hi[Econ 320] has already taught you how to do that, at least to start the game!

- We can use linear regression to estimate these conditional expectations.

---
# Plug-in estimators

- Among treated units, we can pose a model for `$E[Y|D=1,X]$` as
`$$\begin{eqnarray*}
Y_i &=& \beta_0^{D=1} + \beta_1^{D=1} X_{1,i} + \beta_2^{D=1} X_{2,1} + \dots +\beta_k^{D=1} X_{k,i} + \varepsilon^{D=1}_{i}. \\
&=& X_i'\beta^{D=1} + \varepsilon^{D=1}_i.
\end{eqnarray*}$$`

- Among untreated units, we can pose a model for `$E[Y|D=0,X]$` as
`$$\begin{eqnarray*}
Y_i &=& \beta_0^{D=0} + \beta_1^{D=0} X_{1,i} + \beta_2^{D=0} X_{2,1} + \dots +\beta_k^{D=0} X_{k,i} + \varepsilon^{D=0}_{i}. \\
&=& X_i'\beta^{D=0} + \varepsilon^{D=0}_i.
\end{eqnarray*}$$`

- If we believe these models, we have that
`$$\begin{eqnarray*}
{CATE}(X) = X'\beta^{D=1} - X'\beta^{D=0} = X'(\beta^{D=1} - \beta^{D=0}).
\end{eqnarray*}$$`

---
# Plug-in estimators
- The issue is that we don't know the true values of `$\beta^{D=1}$` and `$\beta^{D=0}$`.

- But we can estimate them by running a linear regression of `$Y$` on `$X$` among treated and untreated units, respectively.

- Once we have estimates of `$\beta^{D=1}$` and `$\beta^{D=0}$`, we can plug them into the formulas for the ATE and ATT to get the plug-in estimators of these parameters as
`$$\begin{eqnarray*}
\widehat{ATE} &=& E_n[X]'~(\widehat{\beta}^{D=1} - \widehat{\beta}^{D=0}) , \\
\widehat{ATT} &=& E_n[X|D=1]'~(\widehat{\beta}^{D=1} - \widehat{\beta}^{D=0}),
\end{eqnarray*}$$`
where `$E_n[X]$` and `$E_n[X|D=1]$` are the sample averages of the `$k+1$`-dimensional vector `$X$`  among all units and 
among treated units, respectively.
---
# Let's illustrate these points using simulated data

- To run the simulations *in parallel*, we will use the following packages:
  - doRNG: to set the seed for reproducibility
  - doSNOW: to run the simulations in parallel
  - foreach: to run the simulations in parallel with nice interface
  - estimatr: to estimate the ATE and ATT using the plug-in estimators
 
- A convenient way to install (if necessary) and load everything is by running the below code chunk.

```r
## Load and install the packages that we'll be using today
if (!require("pacman")) install.packages("pacman")
pacman::p_load(estimatr,tidyverse, foreach, doRNG, doSNOW, parallel, ggplot2)
```
---
# Let's illustrate these points using simulated data

```r
# Set the seed for reproducibility
set.seed(20240325)

# Number of observations
n_obs <- 1000
# Generate confounders
x1 <- rnorm(n_obs, mean = 0, sd = 1)
x2 <- rnorm(n_obs, mean = 0, sd = 2)
x3 <- rpois(n_obs, lambda = 1) 
x4 <- rbinom(n_obs, size = 1, prob = 0.2)
# put them all in a matrix
x <- cbind(1, x1, x2, x3, x4)
# Generate Potential Outcomes as linear functions of the confounders
beta_Y0 <- c(1, 2, 3, 4, 5)
beta_Y1 <- c(2, 3, 4, 5, 5)
epislon_Y0 <- rnorm(n_obs, mean = 0, sd = 1)
epislon_Y1 <- rnorm(n_obs, mean = 0, sd = 1)
y1 <- x %*% beta_Y1 + epislon_Y1
y0 <- x %*% beta_Y0 + epislon_Y0
```
---
# Let's illustrate these points using simulated data

```r
# True ATE is (beta_Y1 - beta_Y0) %*% Population Mean of X
true_ATE <- as.numeric( (beta_Y1 - beta_Y0) %*% c(1, 0, 0, 1, 0.2))

# Generate the treatment indicator
beta_pscore <- c(1, 1,-1,1,-1)/3
pscore <- plogis(x %*% beta_pscore)
d <- 1*(runif(n_obs) <= pscore)

# Generate the observed outcomes
y <- y1*d + y0*(1-d)

# Put everything in a data frame
df <- (data.frame(y = y, d = d, x1 = x1, x2 = x2, x3 = x3, x4 = x4))
```

---
#Let's get it going: Estimating E[Y|D=1, X]

```r
# First, let's estimate the parameters associated with the treatment group

reg_treated <- lm_robust(y ~ x1 + x2 + x3 + x4, data = df %>% filter (d==1))
reg_treated
```

```
##             Estimate Std. Error   t value      Pr(>|t|) CI Lower CI Upper  DF
## (Intercept) 1.976888 0.06547941  30.19098 4.697820e-124 1.848301 2.105475 623
## x1          3.010033 0.03930404  76.58329 3.442452e-319 2.932848 3.087217 623
## x2          4.001553 0.02209097 181.13975  0.000000e+00 3.958172 4.044935 623
## x3          5.038368 0.03980441 126.57813  0.000000e+00 4.960201 5.116535 623
## x4          4.959727 0.11145127  44.50131 1.280791e-195 4.740862 5.178593 623
```

```r
# Compare estimated coefficients with the true ones (we only know these because we generated the data)
beta_Y1
```

```
## [1] 2 3 4 5 5
```
---
#Let's get it going: Estimating E[Y|D=0, X]

```r
# Next, let's estimate the parameters associated with the comparison group

reg_untreated <- lm_robust(y ~ x1 + x2 + x3 + x4, data = df %>% filter (d==0))
reg_untreated
```

```
##              Estimate Std. Error   t value      Pr(>|t|)  CI Lower CI Upper  DF
## (Intercept) 0.9756212 0.07611523  12.81769  2.448938e-31 0.8259445 1.125298 367
## x1          2.0165204 0.05199279  38.78461 7.024609e-132 1.9142792 2.118762 367
## x2          3.0052323 0.02334023 128.75763 1.616802e-307 2.9593349 3.051130 367
## x3          3.9653569 0.05144444  77.08038 9.193670e-229 3.8641941 4.066520 367
## x4          5.0606895 0.11646295  43.45322 9.226814e-147 4.8316711 5.289708 367
```

```r
# Compare estimated coefficients with the true ones (we only know these because we generated the data)
beta_Y0
```

```
## [1] 1 2 3 4 5
```
---
#Let's get it going: Estimating ATE

```r
# Now, let's actually compute the ATE the plug-in way

# First, get the CATE estimates
y1_hat <- predict(reg_treated, newdata = df)
y0_hat <- predict(reg_untreated, newdata = df)
cate_hat <- y1_hat - y0_hat

# Now, compute the ATE
ate_hat <- mean(cate_hat)
ate_hat
```

```
## [1] 2.048136
```

```r
# Compare with true, unfeasible ATE
true_ATE
```

```
## [1] 2
```

---
#Let's get it going: comparing

```r
# What if we have run a single regression?
reg_single <- lm_robust(y ~ d + x1 + x2 + x3 + x4, data = df)
reg_single
```

```
##                 Estimate Std. Error      t value      Pr(>|t|)   CI Lower  CI Upper  DF
## (Intercept) -0.007034925 0.10868474  -0.06472781  9.484037e-01 -0.2203128 0.2062429 994
## d            2.171144761 0.12105559  17.93510553  1.559495e-62  1.9335909 2.4086986 994
## x1           2.639992945 0.05387442  49.00271515 2.162323e-267  2.5342723 2.7457136 994
## x2           3.576158143 0.03325877 107.52526007  0.000000e+00  3.5108927 3.6414236 994
## x3           4.676836727 0.05873254  79.62939474  0.000000e+00  4.5615827 4.7920907 994
## x4           5.168677536 0.13981587  36.96774490 6.747058e-189  4.8943094 5.4430457 994
```

```r
# Notice that the coefficient on d is not statistically different from 2, though
(reg_single$coefficients["d"] - 2)/reg_single$std.error["d"]
```

```
##       d 
## 1.41377
```
---
#Let's get it going: comparing

```r
# What if we were to run a simple regression?
reg_simple <- lm_robust(y ~ d, data = df)
reg_simple
```

```
##               Estimate Std. Error   t value     Pr(>|t|)  CI Lower  CI Upper  DF
## (Intercept)  7.6095307  0.4232595 17.978406 8.048498e-63  6.778950 8.4401113 998
## d           -0.9474961  0.5907214 -1.603964 1.090383e-01 -2.106695 0.2117024 998
```
---
#What about the ATT?

- In the above example, the ATE was easy to compute because it does not only depends on the `$CATE(X)$` and the average of `$X$`'s,
`$$\begin{eqnarray*}
ATE = (\beta^{D=1} - \beta^{D=0})E[X]
\end{eqnarray*}$$`

- However, the ATT is not as straightforward to measure because it depends on the distribution of `$X$` among the treated units only.

- And we do not have that explicitly in the DGP!

- This is only about the .hi[TRUE] ATT!

- To estimate the ATT, though, things are very simple, and build on the previous steps, as we show next
---
#Let's get it going: Estimating ATT

```r
# Now, let's actually compute the ATT the plug-in way

# It just averages the cate_hat over the treated units
# Now, compute the ATT
att_hat <- mean(cate_hat[df$d==1])
att_hat
```

```
## [1] 1.761307
```

```r
# But how do I know if this is "close" to the true one?
```
---
# Using simulations to approximate the ATT

- We can use simulations to approximate the ATT

- We can do this by simulating the data many times and computing the ATT each time

- We can then compare the average of these ATTs with the ATT we estimated using the plug-in method

- This is a Monte Carlo simulation

- For the purpose of learning about the true ATT, I recommend picking a large sample size.

- Let's do it!
---
#Let's get it going: Monte Carlo Simulation

```r
# Let's simulate the data many times and compute the ATT each time
n_sim <- 1000 # 1k simulations
n_obs <- 1e6 # 1M observations 
ncores <- 20 # Number of cores to use

#Make cluster to run in parallel
cl <- parallel::makeCluster(ncores)
registerDoSNOW(cl)

# Set seed (guaranteed reproducibility)
seed1 <- 20240326 
set.seed(seed1)
```
---
#Let's get it going: Monte Carlo Simulation

```r
sims_att <- foreach (nn = 1:n_sim) %dorng% {
  # Generate confounders
  x1 <- rnorm(n_obs, mean = 0, sd = 1)
  x2 <- rnorm(n_obs, mean = 0, sd = 2)
  x3 <- rpois(n_obs, lambda = 1) 
  x4 <- rbinom(n_obs, size = 1, prob = 0.2)
  # put them all in a matrix
  x <- cbind(1, x1, x2, x3, x4)
  # Generate Potential Outcomes as linear functions of the confounders
  beta_Y0 <- c(1, 2, 3, 4, 5)
  beta_Y1 <- c(2, 3, 4, 5, 5)
  epislon_Y0 <- rnorm(n_obs, mean = 0, sd = 1)
  epislon_Y1 <- rnorm(n_obs, mean = 0, sd = 1)
  y1 <- x %*% beta_Y1 + epislon_Y1
  y0 <- x %*% beta_Y0 + epislon_Y0
  # Generate the treatment indicator
  beta_pscore <- c(1, 1,-1,1,-1)/3
  pscore <- plogis(x %*% beta_pscore)
  d <- 1*(runif(n_obs) <= pscore)
  #Compute ATT (for the sample)
  att <- mean(y1[d==1] - y0[d==1])
  return(att)
}
```
---
#Let's get it going: Monte Carlo Simulation

```r
#-----------------------------------------------------------------------------
#Stop the cluster
stopCluster(cl)
```

```r
# Compute average of Monte Carlo Sims
mean(unlist(sims_att))
```

```
## [1] 1.778139
```

```r
# standard deviation of Monte Carlo Sims
sd(unlist(sims_att))
```

```
## [1] 0.003585003
```
---
# Using Monte Carlo to evaluate ATE and ATT estimators

```r
# Set some parameters
n_sim <- 1000 # 1k simulations
n_obs <- 10000 # 10k observations
ncores <- 20 # Number of cores to use
```
---
# Using Monte Carlo to evaluate ATE and ATT estimators

```r
# Create a function to simulate the data
simulate_data <- function(n_obs){
  # Generate confounders
  x1 <- rnorm(n_obs, mean = 0, sd = 1); x2 <- rnorm(n_obs, mean = 0, sd = 2)
  x3 <- rpois(n_obs, lambda = 1) ;  x4 <- rbinom(n_obs, size = 1, prob = 0.2)
  # put them all in a matrix
  x <- cbind(1, x1, x2, x3, x4)
  # Generate Potential Outcomes as linear functions of the confounders
  beta_Y0 <- c(1, 2, 3, 4, 5);   beta_Y1 <- c(2, 3, 4, 5, 5)
  epislon_Y0 <- rnorm(n_obs); epislon_Y1 <- rnorm(n_obs)
  y1 <- x %*% beta_Y1 + epislon_Y1
  y0 <- x %*% beta_Y0 + epislon_Y0
  # Generate the treatment indicator
  beta_pscore <- c(1, 1,-1,1,-1)/3
  pscore <- plogis(x %*% beta_pscore)
  d <- 1*(runif(n_obs) <= pscore)
  #observed outcome
  y <- y1*d + y0*(1-d)
  df <- data.frame(y=y, d=d, x1=x1, x2=x2, x3=x3, x4=x4)
  return(df)
}
```
---
# Using Monte Carlo to evaluate ATE and ATT estimators

```r
# Create a function to compute ATE and ATT
ate_att_estimate <- function(df1){
  reg_treated <- estimatr::lm_robust(y ~ x1 + x2 + x3 + x4, data = df1 %>% filter (d==1))
  reg_untreated <- estimatr::lm_robust(y ~ x1 + x2 + x3 + x4, data = df1 %>% filter (d==0))
  y1_hat <- predict(reg_treated, newdata = df1)
  y0_hat <- predict(reg_untreated, newdata = df1)
  cate_hat <- y1_hat - y0_hat
  ate_hat = as.numeric(mean(cate_hat))
  att_hat = as.numeric(mean(cate_hat[df1$d==1]))
  out <- cbind(ate_hat ,att_hat )
  return(out)
}
```
---
# Using Monte Carlo to evaluate ATE and ATT estimators

```r
#Make cluster to run in parallel
cl <- parallel::makeCluster(ncores)
registerDoSNOW(cl)

# Set seed (guaranteed reproducibility)
seed1 <- 20240326 
set.seed(seed1)

sims_eval <- foreach(nn = 1:n_sim,
                     .packages=c("estimatr", "tidyverse")) %dorng% {
  df <- simulate_data(n_obs)
  out <- ate_att_estimate(df)
  return(
   out
  )
}
#-----------------------------------------------------------------------------
#Stop the cluster
stopCluster(cl)
```
---
# Using Monte Carlo to evaluate ATE and ATT estimators

```r
#Put the Monte Carlo Results in a matrix
sims_eval1 <- array(unlist(sims_eval), dim = c(nrow(sims_eval[[1]]), ncol(sims_eval[[1]]),length(sims_eval)))
sims_eval1 <- t(matrix(sims_eval1, 2, n_sim)); colnames(sims_eval1) <- c("ate", "att")
# Compute average of Monte Carlo Sims
colMeans(sims_eval1)
```

```
##      ate      att 
## 1.999320 1.776301
```

```r
# Compute Bias
colMeans(sims_eval1) - c(2, 1.778)
```

```
##           ate           att 
## -0.0006804398 -0.0016986254
```

```r
# Root Mean Squared Error
sqrt(colMeans((sims_eval1 - matrix(c(2, 1.778), nrow = n_sim, ncol = 2) )^2))
```

```
##       ate       att 
## 0.1596595 0.1626643
```
---
# Some Plots

```r
# Theme of the plots
p.style <- theme(
  panel.background = element_rect(fill = "transparent"), # bg of the panel
  plot.background = element_rect(fill = "transparent", color = NA), # bg of the plot
  panel.grid.major = element_blank(), # get rid of major grid
  panel.grid.minor = element_blank(), # get rid of minor grid
  legend.background = element_rect(fill = "transparent"), # get rid of legend bg
  legend.box.background = element_rect(fill = "transparent"), # get rid of legend panel bg
  #legend.title = element_blank(),
  legend.title=element_text(size=9),
  legend.text=element_text(size=8),
  legend.key.size = unit(.5, "cm"))
```
---
# Some Plots

```r
# Plot ATE
p1 <- ggplot(data.frame(sims_eval1), aes(x = ate)) +
 geom_density(color="black",fill="#ed6464",  alpha=.9) +
  geom_vline(xintercept = 2) + 
  p.style +
  labs(x = "Regression Adjusted Estimator for ATE", y = "Density")
```
---
# ATE

```r
p1
```

<img src="13slides_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
---
# Some Plots

```r
# Plot ATT
p2 <- ggplot(data.frame(sims_eval1), aes(x = att)) +
 geom_density(color="black",fill="#fff7bc",  alpha=.9) +
  geom_vline(xintercept = 1.778) + 
  p.style +
  labs(x = "Regression Adjusted Estimator for theATT", y = "Density")
```
---
# ATT

```r
p2
```