Econ 520: Data Science for Economists

This Problem Set builds on the material covered in the lectures on inference with regression adjustment. The goal is to provide you with an opportunity to practice the concepts and techniques covered in class. We will talk about issues with ignoring first-step estimation procedures when making inference, how to address that via influence functions (or clever packages) and via Bootrstrap.

1.(20 points) Throughout this problem set, we will use the same data generating process (DGP). Let \(Y(1), Y(0)\) be the potential outcomes under treatment and control, respectively. Let \(D\) be the treatment indicator, \(X\) be a set of covariates, and the realized outcome is given by \(Y = D Y(1) + (1 - D) Y(0)\).

We will consider the following data generating process: - \(X = (X_1, X_2)\) and each \(X_j\) follows a normal distribution with mean 0 and variance 1. \(X_1\) and \(X_2\) are independent. - The treatment indicator \(D\) is generated as \(D = 1\{u<= p(X)\}\), where \(u\) is a random variable following a uniform distribution, and \(p(X) = exp(-1 + 0.5X_1 - 0.5X_2)/(1+exp(-1 + 0.5X_1 - 0.5X_2))\)

The treated and untreated potential outcomes are generated as \(Y(1) = 1 + X_{1} + X_2 + \varepsilon(1)\) and \(Y_i(0) = -X_1 - X_2 + \varepsilon(0)\), respectively, where \(\varepsilon\)’s follows a normal distribution with mean 0 and variance 1, and they are independent of each other.
Observed data is \((Y_i, D_i, X_{1,i}, X_{2,i})_{i=1}^n\), where \(Y_i = D_i Y_i(1) + (1-D_i) Y_i(0)\).

Write a function in R (or other software) that generates data according to the above data generating process. The function should take as inputs the sample size. The function should return a data frame with the generated data.

2.(20 points) Supposed you are interested in estimating the ATE using a linear regression model. You decided to use the following linear regression model to estimate the ATE: \[ Y_i = \alpha + \beta D_i + \gamma_1 \tilde{X}_{1,i} + \gamma_2 \tilde{X}_{2,i} + \gamma_3 D*(\tilde{X}_{1,i}) + \gamma_4 D*(\tilde{X}_{2,i}) +\varepsilon_i, \] where, for \(j \in \{0,1\}\), \(\tilde{X}_{j,i} = X_{j,i} - \overline{X}_{j}\), with \(\overline{X}_{j}\) being the sample mean of \(X_j\).

Write a function that takes as input a data frame with the structure of the data frame you generated in question 1. It should then estimate all the coefficients of the linear regression model above, and returns the estimated \(\beta\), its standard error, and its 95% confidence interval. Notice that this function returns four values: the estimated \(\beta\), its standard error, and the lower and upper bounds of the 95% confidence interval.

3.(20 points) Suppose you are interested in estimating the ATE using a linear regression model. You decided to use rely on the targeted package in R. Write a function that takes as input a data frame with the structure of the data frame you generated in question 1. It should then estimate the ATE using the targeted package, and returns the estimated ATE, its standard error, and its 95% confidence interval. Notice that this function returns four values: the estimated ATE, its standard error, and the lower and upper bounds of the 95% confidence interval.

4.(30 points) Suppose you are interested in estimating the ATE using a linear regression model. You are more old school, and want to do things by hand, and use the bootstrap to make inference. Write a function that takes as input a data frame with the structure of the data frame you generated in question 1, as well as the number of bootstrap repetitions you want to do. Withing this function, you should run the following regressions: \[\begin{eqnarray*} Y_i &=& \alpha_{D=1} + X_{1,i} \beta_{1, D=1} + X_{2,i} \beta_{2, D=1} + \varepsilon_i, \text{ for units } D_i = 1,\\ Y_i &=& \alpha_{D=0} + X_{1,i} \beta_{1, D=0} + X_{2,i} \beta_{2, D=0} + \varepsilon_i, \text{ for units } D_i = 0, \end{eqnarray*}\] where \(\alpha_{D=1}\) and \(\alpha_{D=0}\) are the intercepts, \(\beta_{D=1}\) and \(\beta_{D=0}\) are the coefficients of the covariates. Based on these estimated coefficients, you should estimate the ATE as \[\begin{eqnarray*} \widehat{ATE}_{ra} &=& \dfrac{1}{n}\sum_{i=1}^n \bigg( \widehat{Y}_i(1) - \widehat{Y}_i(0)\bigg), \end{eqnarray*}\] where \(\widehat{Y}_i(1) = \widehat{\alpha}_{D=1} + X_{1,i} \widehat{\beta}_{1, D=1} + X_{2,i} \widehat{\beta}_{2, D=1}\) and \(\widehat{Y}_i(0) = \widehat{\alpha}_{D=0} + X_{1,i} \widehat{\beta}_{1, D=0} + X_{2,i} \widehat{\beta}_{2, D=0}\).

You should also use the boot package to estimate the bootstrap standard error of your regression-adjusted ATE. Based on the bootstrap standard error, you should also estimate the 95% confidence interval of the ATE as \(\widehat{ATE}_{ra} \pm 1.96 \times SE(\widehat{ATE}_{ra})\).

Finally, the function should return the estimated ATE, its standard error, and its 95% confidence interval. Notice that this function returns four values: the estimated ATE, its standard error, and the lower and upper bounds of the 95% confidence interval.

5.(30 points) Run a Monte Carlo simulation with 1,000 simulations, sample size 1,000, and 1,000 bootstrap draws. For each simulation, you need to generate the data using the function in part 1, and then compute the three alternative estimates of the ATE, using the functions in part 2, 3, 4. Then, you should compute the length of the confidence interval for each estimator, i.e., upper bound - lower bound. You should also check if the true ATE, 1, is inside the confidence interval and created a variable for each estimator (“coverage indicator”) that is equal to one if the true ATE is inside its confidence interval, and zero otherwise.

For each simulation, you should store the estimated ATE, its standard error, the length of its confidence interval, and the coverage indicator for each estimator; that is, 12 different numbers for each simulation.

The output of the Monte Carlo simulations should be a matrix with 1,000 rows and 12 columns, where each row corresponds to a simulation, and each column corresponds to the statistics of one estimator. You should fix the seed of the random number generator so the Monte Carlo results are reproducible.

6.(40 points) Based on the Monte Carlo simulation, you should answer the following questions:

(10 points) Compute the bias and root mean squared error of each estimator across the Monte Carlo. Discuss your results. Are all estimators unbiased? Are the RMSE across these estimators very different? Is this expected?
(10 points) Compute the coverage rate of the 95% confidence interval for each estimator across the Monte Carlo. Discuss your results. Which estimator has the best coverage rate?
(10 points) Compute the average length of the 95% confidence interval for each estimator across the Monte Carlo. Discuss your results. Which estimator has the smallest average length, i.e., is the most precise?
(10 points) Compute the average of the standard errors of the ATE estimates across the Monte Carlo. Discuss your results. Which estimator has the smallest average standard error? Is this conclusion similar to the one about the average length of the confidence interval? Why or why not?

Econ 520: Data Science for Economists

Problem Set 5 - Inference with Regression Adjstment

Pedro Sant’Anna