class: title-slide # Econ 520: Data Science for Economists ## Lecture 13: Linear Regression Adjustments under Unconfoundedness: Estimation <br> <p align=center> Pedro H. C. Sant'Anna </p> <div style="margin-top: -.7cm;"></div> <p align=center> Emory University </p> <br> <p align=center> Spring 2024 </p> --- class: center, middle name: prologue # Main Goal <style type="text/css"> .small-code .remark-code{ font-size: 40% } .medium-code .remark-code{ font-size: 55% } </style> <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Main Goals - The main goal of this lecture is to discuss how we can use linear regression adjustments for causal inference under unconfoundedness. - We will cover the principles of it, in theory and in practice. - We will illustrate in a simulated exercise. - Introduce Monte Carlo Simulations - Form basis for a problem set due on April 4th. --- class: center, middle name: CI # Causal Inference under Unconfoundedness <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1100px></html> --- # Causal Inference without an Experiment - We have discussed in Slide decks 10 and 11 that, with A/B testing, we can estimate causal effects by using simple comparisons of means, `$$\begin{eqnarray*} E[Y|D=1] - E[Y|D=0] = ATT =ATE. \end{eqnarray*}$$` - That follows because the treatment is randomly assigned, i.e., `\(D \perp Y(0), Y(1)\)`. - In many practical situations, however, we do not have the luxury of an experiment. - How can we uncover average treatment effects in these cases? --- # Unconfooundedness - The usual starting point to tackle this problem is the assumption of .hi[unconfoundedness] `$$\begin{eqnarray*} Y(1), Y(0) \perp D|X. \end{eqnarray*}$$` - This assumption is also known as .hi[selection on observables]. - Here, `\(X\)` is a set of observed covariates that affect both the treatment assignment and the potential outcomes. - We usually refer to `\(X\)` as confounders. --- # Unconfooundedness + Overlap - We will also impose the .hi[overlap] assumption, i.e., for some `\(0<\varepsilon<0.5\)`, `\(\varepsilon < P(D=1|X) < 1-\varepsilon\)`. - This implies that every units with a given value of `\(X\)` has a positive probability of receiving either treatment or control. - It is also known as the .hi[common support] assumption. - This is important because it ensures that we can estimate the treatment effect for all units in the population. - This assumption rules out deterministic treatment assignment. - .hi[Why is this important?] --- # How does unconfoundedness + overlap help? - The unconfoundedness + overlap assumptions imply that we can write the conditional expectation of the potential outcomes as `$$\begin{eqnarray*} E[Y(1)|X] = E[Y(1)|D=1,X] = E[Y|D=1,X]. \end{eqnarray*}$$` - Analogously, we have that `$$\begin{eqnarray*} E[Y(0)|X] = E[Y(0)|D=1,X] = E[Y|D=0,X]. \end{eqnarray*}$$` - Thus, we can identify the conditional ATE (CATE) given `\(X\)` as `$$\begin{eqnarray*} CATE(X) &\equiv& E[Y(1)- Y(0)|X] \\ &=& E[Y(1)|X] - E[Y(0)|X] \\ &=& E[Y|D=1,X] - E[Y|D=0,X]. \end{eqnarray*}$$` --- # Q1 of Problem Set 3: CATE(X) vs CATT(X) - The conditional average treatment effect on the treated (CATT) given `\(X\)` is defined as `$$\begin{eqnarray*} CATT(X) &\equiv& E[Y(1)- Y(0)|D=1,X] \\ \end{eqnarray*}$$` - .hi[Question:] Under unconfoundednss + overlap, is the `\(CATE(X)\)` different from the `\(CATT(X)\)`? - .hi[Prove it mathematically if this is the case or not.] (Question 1 of PS 3). --- # From CATE(X) to ATE and ATT - The CATE(X) is the average treatment effect for a given value of `\(X\)`. - This is a functional parameter: it can take a different value for each value of `\(X\)`. - .hi[Why is this important?] - With A/B test, however, our main interest was not into the CATE(X), but into the ATE and ATT. - .hi[How can we go from the CATE(X) to the ATE and ATT?] - We can integrate the CATE(X) over the distribution of `\(X\)` to get the ATE - We can average the CATE(X) over the distribution of `\(X\)` among treated to get the ATT. --- # From CATE(X) to ATE - Let's show this in math terms! - Given our previous results and the Law of Iterated Expectations, we have that `$$\begin{eqnarray*} ATE &\equiv& E[Y(1)- Y(0)] \\ &=& E[E[Y(1)- Y(0)|X]] \\ &=& E[CATE(X)]\\ &=& E[E[Y|D=1,X] - E[Y|D=0,X]]\\ &=& E[E[Y|D=1,X]] - E[E[Y|D=0,X]]\\ \end{eqnarray*}$$` --- # From CATE(X) to ATT - .hi[Q2 of Problem Set 3: From CATE(X) to ATT] - Prove that, stating and justifying each step of the calculations that, under unconfoundedness + overlap, the ATT can be written as `$$\begin{eqnarray*} ATT &\equiv& E[Y(1)- Y(0)|D=1] = E[Y|D=1] - E[E[Y|D=0,X]|D=1]. \end{eqnarray*}$$` --- # Plug-in estimators - We can estimate the ATE and ATT by plugging in the estimates of the conditional expectations. - To do that, all we need to do is come up with a model for `\(E[Y|D=1,X]\)` and `\(E[Y|D=0,X]\)`. - But .hi[Econ 320] has already taught you how to do that, at least to start the game! - We can use linear regression to estimate these conditional expectations. --- # Plug-in estimators - Among treated units, we can pose a model for `\(E[Y|D=1,X]\)` as `$$\begin{eqnarray*} Y_i &=& \beta_0^{D=1} + \beta_1^{D=1} X_{1,i} + \beta_2^{D=1} X_{2,1} + \dots +\beta_k^{D=1} X_{k,i} + \varepsilon^{D=1}_{i}. \\ &=& X_i'\beta^{D=1} + \varepsilon^{D=1}_i. \end{eqnarray*}$$` - Among untreated units, we can pose a model for `\(E[Y|D=0,X]\)` as `$$\begin{eqnarray*} Y_i &=& \beta_0^{D=0} + \beta_1^{D=0} X_{1,i} + \beta_2^{D=0} X_{2,1} + \dots +\beta_k^{D=0} X_{k,i} + \varepsilon^{D=0}_{i}. \\ &=& X_i'\beta^{D=0} + \varepsilon^{D=0}_i. \end{eqnarray*}$$` - If we believe these models, we have that `$$\begin{eqnarray*} {CATE}(X) = X'\beta^{D=1} - X'\beta^{D=0} = X'(\beta^{D=1} - \beta^{D=0}). \end{eqnarray*}$$` --- # Plug-in estimators - The issue is that we don't know the true values of `\(\beta^{D=1}\)` and `\(\beta^{D=0}\)`. - But we can estimate them by running a linear regression of `\(Y\)` on `\(X\)` among treated and untreated units, respectively. - Once we have estimates of `\(\beta^{D=1}\)` and `\(\beta^{D=0}\)`, we can plug them into the formulas for the ATE and ATT to get the plug-in estimators of these parameters as `$$\begin{eqnarray*} \widehat{ATE} &=& E_n[X]'~(\widehat{\beta}^{D=1} - \widehat{\beta}^{D=0}) , \\ \widehat{ATT} &=& E_n[X|D=1]'~(\widehat{\beta}^{D=1} - \widehat{\beta}^{D=0}), \end{eqnarray*}$$` where `\(E_n[X]\)` and `\(E_n[X|D=1]\)` are the sample averages of the `\(k+1\)`-dimensional vector `\(X\)` among all units and among treated units, respectively. --- # Let's illustrate these points using simulated data - To run the simulations *in parallel*, we will use the following packages: - doRNG: to set the seed for reproducibility - doSNOW: to run the simulations in parallel - foreach: to run the simulations in parallel with nice interface - estimatr: to estimate the ATE and ATT using the plug-in estimators - A convenient way to install (if necessary) and load everything is by running the below code chunk. ```r ## Load and install the packages that we'll be using today if (!require("pacman")) install.packages("pacman") pacman::p_load(estimatr,tidyverse, foreach, doRNG, doSNOW, parallel, ggplot2) ``` --- # Let's illustrate these points using simulated data ```r # Set the seed for reproducibility set.seed(20240325) # Number of observations n_obs <- 1000 # Generate confounders x1 <- rnorm(n_obs, mean = 0, sd = 1) x2 <- rnorm(n_obs, mean = 0, sd = 2) x3 <- rpois(n_obs, lambda = 1) x4 <- rbinom(n_obs, size = 1, prob = 0.2) # put them all in a matrix x <- cbind(1, x1, x2, x3, x4) # Generate Potential Outcomes as linear functions of the confounders beta_Y0 <- c(1, 2, 3, 4, 5) beta_Y1 <- c(2, 3, 4, 5, 5) epislon_Y0 <- rnorm(n_obs, mean = 0, sd = 1) epislon_Y1 <- rnorm(n_obs, mean = 0, sd = 1) y1 <- x %*% beta_Y1 + epislon_Y1 y0 <- x %*% beta_Y0 + epislon_Y0 ``` --- # Let's illustrate these points using simulated data ```r # True ATE is (beta_Y1 - beta_Y0) %*% Population Mean of X true_ATE <- as.numeric( (beta_Y1 - beta_Y0) %*% c(1, 0, 0, 1, 0.2)) # Generate the treatment indicator beta_pscore <- c(1, 1,-1,1,-1)/3 pscore <- plogis(x %*% beta_pscore) d <- 1*(runif(n_obs) <= pscore) # Generate the observed outcomes y <- y1*d + y0*(1-d) # Put everything in a data frame df <- (data.frame(y = y, d = d, x1 = x1, x2 = x2, x3 = x3, x4 = x4)) ``` --- #Let's get it going: Estimating E[Y|D=1, X] ```r # First, let's estimate the parameters associated with the treatment group reg_treated <- lm_robust(y ~ x1 + x2 + x3 + x4, data = df %>% filter (d==1)) reg_treated ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 1.976888 0.06547941 30.19098 4.697820e-124 1.848301 2.105475 623 ## x1 3.010033 0.03930404 76.58329 3.442452e-319 2.932848 3.087217 623 ## x2 4.001553 0.02209097 181.13975 0.000000e+00 3.958172 4.044935 623 ## x3 5.038368 0.03980441 126.57813 0.000000e+00 4.960201 5.116535 623 ## x4 4.959727 0.11145127 44.50131 1.280791e-195 4.740862 5.178593 623 ``` ```r # Compare estimated coefficients with the true ones (we only know these because we generated the data) beta_Y1 ``` ``` ## [1] 2 3 4 5 5 ``` --- #Let's get it going: Estimating E[Y|D=0, X] ```r # Next, let's estimate the parameters associated with the comparison group reg_untreated <- lm_robust(y ~ x1 + x2 + x3 + x4, data = df %>% filter (d==0)) reg_untreated ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 0.9756212 0.07611523 12.81769 2.448938e-31 0.8259445 1.125298 367 ## x1 2.0165204 0.05199279 38.78461 7.024609e-132 1.9142792 2.118762 367 ## x2 3.0052323 0.02334023 128.75763 1.616802e-307 2.9593349 3.051130 367 ## x3 3.9653569 0.05144444 77.08038 9.193670e-229 3.8641941 4.066520 367 ## x4 5.0606895 0.11646295 43.45322 9.226814e-147 4.8316711 5.289708 367 ``` ```r # Compare estimated coefficients with the true ones (we only know these because we generated the data) beta_Y0 ``` ``` ## [1] 1 2 3 4 5 ``` --- #Let's get it going: Estimating ATE ```r # Now, let's actually compute the ATE the plug-in way # First, get the CATE estimates y1_hat <- predict(reg_treated, newdata = df) y0_hat <- predict(reg_untreated, newdata = df) cate_hat <- y1_hat - y0_hat # Now, compute the ATE ate_hat <- mean(cate_hat) ate_hat ``` ``` ## [1] 2.048136 ``` ```r # Compare with true, unfeasible ATE true_ATE ``` ``` ## [1] 2 ``` --- #Let's get it going: comparing ```r # What if we have run a single regression? reg_single <- lm_robust(y ~ d + x1 + x2 + x3 + x4, data = df) reg_single ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) -0.007034925 0.10868474 -0.06472781 9.484037e-01 -0.2203128 0.2062429 994 ## d 2.171144761 0.12105559 17.93510553 1.559495e-62 1.9335909 2.4086986 994 ## x1 2.639992945 0.05387442 49.00271515 2.162323e-267 2.5342723 2.7457136 994 ## x2 3.576158143 0.03325877 107.52526007 0.000000e+00 3.5108927 3.6414236 994 ## x3 4.676836727 0.05873254 79.62939474 0.000000e+00 4.5615827 4.7920907 994 ## x4 5.168677536 0.13981587 36.96774490 6.747058e-189 4.8943094 5.4430457 994 ``` ```r # Notice that the coefficient on d is not statistically different from 2, though (reg_single$coefficients["d"] - 2)/reg_single$std.error["d"] ``` ``` ## d ## 1.41377 ``` --- #Let's get it going: comparing ```r # What if we were to run a simple regression? reg_simple <- lm_robust(y ~ d, data = df) reg_simple ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 7.6095307 0.4232595 17.978406 8.048498e-63 6.778950 8.4401113 998 ## d -0.9474961 0.5907214 -1.603964 1.090383e-01 -2.106695 0.2117024 998 ``` --- #What about the ATT? - In the above example, the ATE was easy to compute because it does not only depends on the `\(CATE(X)\)` and the average of `\(X\)`'s, `$$\begin{eqnarray*} ATE = (\beta^{D=1} - \beta^{D=0})E[X] \end{eqnarray*}$$` - However, the ATT is not as straightforward to measure because it depends on the distribution of `\(X\)` among the treated units only. - And we do not have that explicitly in the DGP! - This is only about the .hi[TRUE] ATT! - To estimate the ATT, though, things are very simple, and build on the previous steps, as we show next --- #Let's get it going: Estimating ATT ```r # Now, let's actually compute the ATT the plug-in way # It just averages the cate_hat over the treated units # Now, compute the ATT att_hat <- mean(cate_hat[df$d==1]) att_hat ``` ``` ## [1] 1.761307 ``` ```r # But how do I know if this is "close" to the true one? ``` --- # Using simulations to approximate the ATT - We can use simulations to approximate the ATT - We can do this by simulating the data many times and computing the ATT each time - We can then compare the average of these ATTs with the ATT we estimated using the plug-in method - This is a Monte Carlo simulation - For the purpose of learning about the true ATT, I recommend picking a large sample size. - Let's do it! --- #Let's get it going: Monte Carlo Simulation ```r # Let's simulate the data many times and compute the ATT each time n_sim <- 1000 # 1k simulations n_obs <- 1e6 # 1M observations ncores <- 20 # Number of cores to use #Make cluster to run in parallel cl <- parallel::makeCluster(ncores) registerDoSNOW(cl) # Set seed (guaranteed reproducibility) seed1 <- 20240326 set.seed(seed1) ``` --- #Let's get it going: Monte Carlo Simulation ```r sims_att <- foreach (nn = 1:n_sim) %dorng% { # Generate confounders x1 <- rnorm(n_obs, mean = 0, sd = 1) x2 <- rnorm(n_obs, mean = 0, sd = 2) x3 <- rpois(n_obs, lambda = 1) x4 <- rbinom(n_obs, size = 1, prob = 0.2) # put them all in a matrix x <- cbind(1, x1, x2, x3, x4) # Generate Potential Outcomes as linear functions of the confounders beta_Y0 <- c(1, 2, 3, 4, 5) beta_Y1 <- c(2, 3, 4, 5, 5) epislon_Y0 <- rnorm(n_obs, mean = 0, sd = 1) epislon_Y1 <- rnorm(n_obs, mean = 0, sd = 1) y1 <- x %*% beta_Y1 + epislon_Y1 y0 <- x %*% beta_Y0 + epislon_Y0 # Generate the treatment indicator beta_pscore <- c(1, 1,-1,1,-1)/3 pscore <- plogis(x %*% beta_pscore) d <- 1*(runif(n_obs) <= pscore) #Compute ATT (for the sample) att <- mean(y1[d==1] - y0[d==1]) return(att) } ``` --- #Let's get it going: Monte Carlo Simulation ```r #----------------------------------------------------------------------------- #Stop the cluster stopCluster(cl) ``` ```r # Compute average of Monte Carlo Sims mean(unlist(sims_att)) ``` ``` ## [1] 1.778139 ``` ```r # standard deviation of Monte Carlo Sims sd(unlist(sims_att)) ``` ``` ## [1] 0.003585003 ``` --- # Using Monte Carlo to evaluate ATE and ATT estimators ```r # Set some parameters n_sim <- 1000 # 1k simulations n_obs <- 10000 # 10k observations ncores <- 20 # Number of cores to use ``` --- # Using Monte Carlo to evaluate ATE and ATT estimators ```r # Create a function to simulate the data simulate_data <- function(n_obs){ # Generate confounders x1 <- rnorm(n_obs, mean = 0, sd = 1); x2 <- rnorm(n_obs, mean = 0, sd = 2) x3 <- rpois(n_obs, lambda = 1) ; x4 <- rbinom(n_obs, size = 1, prob = 0.2) # put them all in a matrix x <- cbind(1, x1, x2, x3, x4) # Generate Potential Outcomes as linear functions of the confounders beta_Y0 <- c(1, 2, 3, 4, 5); beta_Y1 <- c(2, 3, 4, 5, 5) epislon_Y0 <- rnorm(n_obs); epislon_Y1 <- rnorm(n_obs) y1 <- x %*% beta_Y1 + epislon_Y1 y0 <- x %*% beta_Y0 + epislon_Y0 # Generate the treatment indicator beta_pscore <- c(1, 1,-1,1,-1)/3 pscore <- plogis(x %*% beta_pscore) d <- 1*(runif(n_obs) <= pscore) #observed outcome y <- y1*d + y0*(1-d) df <- data.frame(y=y, d=d, x1=x1, x2=x2, x3=x3, x4=x4) return(df) } ``` --- # Using Monte Carlo to evaluate ATE and ATT estimators ```r # Create a function to compute ATE and ATT ate_att_estimate <- function(df1){ reg_treated <- estimatr::lm_robust(y ~ x1 + x2 + x3 + x4, data = df1 %>% filter (d==1)) reg_untreated <- estimatr::lm_robust(y ~ x1 + x2 + x3 + x4, data = df1 %>% filter (d==0)) y1_hat <- predict(reg_treated, newdata = df1) y0_hat <- predict(reg_untreated, newdata = df1) cate_hat <- y1_hat - y0_hat ate_hat = as.numeric(mean(cate_hat)) att_hat = as.numeric(mean(cate_hat[df1$d==1])) out <- cbind(ate_hat ,att_hat ) return(out) } ``` --- # Using Monte Carlo to evaluate ATE and ATT estimators ```r #Make cluster to run in parallel cl <- parallel::makeCluster(ncores) registerDoSNOW(cl) # Set seed (guaranteed reproducibility) seed1 <- 20240326 set.seed(seed1) sims_eval <- foreach(nn = 1:n_sim, .packages=c("estimatr", "tidyverse")) %dorng% { df <- simulate_data(n_obs) out <- ate_att_estimate(df) return( out ) } #----------------------------------------------------------------------------- #Stop the cluster stopCluster(cl) ``` --- # Using Monte Carlo to evaluate ATE and ATT estimators ```r #Put the Monte Carlo Results in a matrix sims_eval1 <- array(unlist(sims_eval), dim = c(nrow(sims_eval[[1]]), ncol(sims_eval[[1]]),length(sims_eval))) sims_eval1 <- t(matrix(sims_eval1, 2, n_sim)); colnames(sims_eval1) <- c("ate", "att") # Compute average of Monte Carlo Sims colMeans(sims_eval1) ``` ``` ## ate att ## 1.999320 1.776301 ``` ```r # Compute Bias colMeans(sims_eval1) - c(2, 1.778) ``` ``` ## ate att ## -0.0006804398 -0.0016986254 ``` ```r # Root Mean Squared Error sqrt(colMeans((sims_eval1 - matrix(c(2, 1.778), nrow = n_sim, ncol = 2) )^2)) ``` ``` ## ate att ## 0.1596595 0.1626643 ``` --- # Some Plots ```r # Theme of the plots p.style <- theme( panel.background = element_rect(fill = "transparent"), # bg of the panel plot.background = element_rect(fill = "transparent", color = NA), # bg of the plot panel.grid.major = element_blank(), # get rid of major grid panel.grid.minor = element_blank(), # get rid of minor grid legend.background = element_rect(fill = "transparent"), # get rid of legend bg legend.box.background = element_rect(fill = "transparent"), # get rid of legend panel bg #legend.title = element_blank(), legend.title=element_text(size=9), legend.text=element_text(size=8), legend.key.size = unit(.5, "cm")) ``` --- # Some Plots ```r # Plot ATE p1 <- ggplot(data.frame(sims_eval1), aes(x = ate)) + geom_density(color="black",fill="#ed6464", alpha=.9) + geom_vline(xintercept = 2) + p.style + labs(x = "Regression Adjusted Estimator for ATE", y = "Density") ``` --- # ATE ```r p1 ``` <img src="13slides_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Some Plots ```r # Plot ATT p2 <- ggplot(data.frame(sims_eval1), aes(x = att)) + geom_density(color="black",fill="#fff7bc", alpha=.9) + geom_vline(xintercept = 1.778) + p.style + labs(x = "Regression Adjusted Estimator for theATT", y = "Density") ``` --- # ATT ```r p2 ``` <img src="13slides_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />