Econ 520: Data Science for Economists

1.(100 points) Suppose your data comes from an experiment where the treatment is randomly assigned. Let \(Y(1), Y(0)\) be the potential outcomes under treatment and control, respectively. Let \(D\) be the treatment indicator, and \(X\) be a set of covariates. Since we have an completely randomized experiment, it follows that \(Y(1), Y(0) \perp D.\)

Here, we have a few options to estimate the average treatment effect (ATE). One of the most common methods is the difference-in-means estimator, which is given by \[\begin{eqnarray*} \widehat{ATE}_{dm} &=& \frac{\sum_{i=1}^N D_i Y_i}{\sum_{i=1}^N D_i} - \frac{\sum_{i=1}^N (1-D_i) Y_i}{\sum_{i=1}^N (1-D_i)}\\ &=& \overline{Y}_{D=1} - \overline{Y}_{D=0}, \end{eqnarray*}\] where \(\overline{Y}_{D=1}\) and \(\overline{Y}_{D=0}\) are the sample means of the outcome variable for the treated and control groups, respectively.

Another common methods are related to using linear regressions, as you have seen in Econ 320. That is, one can use the following regression model: \[\begin{eqnarray*} Y_i &=& \alpha + \theta D_i + \beta_1 X_{1i} + \dots + \beta_k X_{ki} \varepsilon_i, \end{eqnarray*}\] where \(\alpha\) is the intercept, \(\theta\) is an average treatment effect, and \(\beta_1, \dots, \beta_k\) are the coefficients of the covariates. Researchers often interpret estimates of \(\theta\) as estimates the ATE.

A third option is to use regression adjustments, as discussed in Slide 13. That is, one can split the data into treated units and control units, and then regress the outcome variable on the covariates for each group. The average treatment effect can then be estimated as the difference between the average predicted outcomes for the treated and control groups. More precisely, if one uses linear regressions as working models, one can estimate the ATE by running the following regressions: \[\begin{eqnarray*} Y_i &=& \alpha_{D=1} + X_i' \beta_{D=1} + \varepsilon_i, \text{ for units } D_i = 1,\\ Y_i &=& \alpha_{D=0} + X_i' \beta_{D=0} + \varepsilon_i, \text{ for units } D_i = 0, \end{eqnarray*}\] where \(\alpha_{D=1}\) and \(\alpha_{D=0}\) are the intercepts, \(\beta_{D=1}\) and \(\beta_{D=0}\) are the coefficients of the covariates, and \(X_i'\) is the vector of covariates for unit \(i\).

Once these regressions are estimated, the ATE can be estimated as \[\begin{eqnarray*} \widehat{ATE}_{ra} &=& \widehat{\alpha}_{D=1} - \widehat{\alpha}_{D=0} + \overline{X} \widehat{\beta}_{D=1} - \overline{X} \widehat{\beta}_{D=0}, \end{eqnarray*}\] where \(\overline{X}\) is the sample means of the covariates across all units.

In this question, you will compare the performance of these three methods in estimating the ATE using Monte Carlo Simulations.

We will consider the following data generating process: - \(X\) is univariate, and follows a normal distribution with mean 0 and variance 1. - The treatment indicator \(D\) is independent of everything, and is generated as \(1\{u < 0.2\}\), where \(u\) is a random variable following a uniform distribution. - The treated and untreated potential outcomes are generated as \(Y_i(1) = X_i + \varepsilon_i(1)\) and \(Y_i(0) = -X_i + \varepsilon_i(0)\), respectively, where \(\varepsilon\)’s follows a normal distribution with mean 0 and variance 1, and they are independent of each other. - Observed data is \((Y_i, D_i, X_i)_{i=1}^n\), where \(Y_i = D_i Y_i(1) + (1-D_i) Y_i(0)\).

Given the above DGP, we know that \(ATE=2E[X] = 0\). We would like to estimate the ATE as precise as possible.

(15 points) Write a function in R that generates data according to the above data generating process The function should take as inputs the sample size. The function should return a data frame with the generated data.
(10 points) Write a function in R that estimates the ATE using the difference-in-means estimator. The function should take as input the data frame generated by the function in the previous question and return the estimated ATE.
(10 points) Write a function in R that estimates the ATE using the linear regression estimator. The function should take as input the data frame generated by the function in the first question and return the estimated \(\theta\) (which will be interpreted as an estimated of the ATE)
(10 points) Write a function in R that estimates the ATE using the regression adjustment estimator. The function should take as input the data frame generated by the function in the first question and return the estimated ATE.
(10 points) Run a Monte Carlo simulation with 1,000 simulations, and sample size 1,000. For each simulation, you need to generate the data using the above DGP, and compute the three alternative estimates of the ATE. The output should be a matrix with 1000 rows and 3 columns, where each row corresponds to a simulation, and each column corresponds to an estimator. You should fix the seed of the random number generator so the Monte Carlo results are reproducible.
(10 points) Compute the bias of each estimator across the Monte Carlo. Discuss your results. Are all estimators unbiased? Which is the most biased?
(10 points) Compute the variance of each estimator across the Monte Carlo. Discuss your results. Which estimator has the smallest variance?
(10 points) Compute the root mean squared error of each estimator across the Monte Carlo. Discuss your results. Which estimator has the smallest mean squared error?
(15 points) Repeat the Monte Carlo simulation but now with sample size 100,000. Discuss your results. How does the sample size affect the performance of the estimators?

2.(50 points) Suppose now that selection into treatment is not random, i.e., we do not have a A/B test like in Question 4. More specifically, let’s assume that the treatment indicator \(D\) is a function of the covariates \(X\), i.e., \(D = 1\{u<= p(X)\}\), where \(u\) is a random variable following a uniform distribution, and \(p(X) = exp(-1 + 0.5X)/(1+exp(-1 + 0.5X))\).

Repeat the Monte Carlo simulation in Question 4, but now considering the new treatment assignment. Discuss your results. How does the performance of the estimators change when selection into treatment is not random?

Econ 520: Data Science for Economists

Problem Set 4 - Regression Adjstment and Simulations

Pedro Sant’Anna