What assumptions let us recover the ATT?
Lecture 5: Introduction to Difference-in-Differences
Emory University
Spring 2026
Lectures 3–4: Causal inference with randomized panel experiments
Today: What if treatment is not randomized?
Our approach:
The full journey today:
Baker et al. (2025): A Practitioner’s Guide to Difference-in-Differences
Policy: Affordable Care Act (ACA) Medicaid expansion
Outcome: County-level mortality rate (ages 20–64), deaths per 100,000
Question: Did Medicaid expansion reduce mortality?
\(2 \times 2\) Setup:
Time periods: \(t \in \{1, 2\}\) (e.g., 2013 and 2014)
Treatment group indicator: \(G_i \in \{2, \infty\}\)
Treatment indicator: \(D_{i,t} = \mathbf{1}\{G_i \leq t\}\)
Key features:
Recall from Lecture 2: potential outcomes indexed by treatment sequence
Specialize to \(2 \times 2\): Only two possible treatment paths
SUTVA (Stable Unit Treatment Value Assumption):
Observed outcome: \[ Y_{i,t} = \mathbf{1}\{G_i = 2\} \cdot Y_{i,t}(2) + \mathbf{1}\{G_i = \infty\} \cdot Y_{i,t}(\infty) \]
This is the same potential outcomes framework from Lectures 2–4, just specialized to two groups and two periods
Parameter of interest: The ATT at period 2 \[ ATT \;=\; \mathbb{E}\big[Y_{i,t=2}(2) - Y_{i,t=2}(\infty) \;\big|\; G_i = 2\big] \]
Interpretation:
In the Medicaid example:
The fundamental problem: We observe \(Y_{i,t=2}(2)\) for the treated group, but we never observe \(Y_{i,t=2}(\infty)\) for the treated group
How do we impute the missing counterfactual?
\[ \theta^{DiD} = \Big(\mathbb{E}[Y_{i,t=2} | G_i = 2] - \mathbb{E}[Y_{i,t=1} | G_i = 2]\Big) - \Big(\mathbb{E}[Y_{i,t=2} | G_i = \infty] - \mathbb{E}[Y_{i,t=1} | G_i = \infty]\Big) \]
Two fundamental questions for the rest of this lecture:
| \(Y_{i,t=1}(\infty)\) | \(Y_{i,t=1}(2)\) | \(Y_{i,t=2}(\infty)\) | \(Y_{i,t=2}(2)\) | |
|---|---|---|---|---|
| \(G_i = 2\) (Treated) | ? | \(\checkmark\) | ? | \(\checkmark\) |
| \(G_i = \infty\) (Comparison) | \(\checkmark\) | ? | \(\checkmark\) | ? |
| \(Y_{i,t=1}(\infty)\) | \(Y_{i,t=1}(2)\) | \(Y_{i,t=2}(\infty)\) | \(Y_{i,t=2}(2)\) | |
|---|---|---|---|---|
| \(G_i = 2\) (Treated) | \(\checkmark\)* | \(\checkmark\) | ? | \(\checkmark\) |
| \(G_i = \infty\) (Comparison) | \(\checkmark\) | — | \(\checkmark\) | — |
*Same as \(Y_{i,t=1}(2)\) under No-Anticipation
\[ \begin{aligned} &\mathbb{E}[Y_{i,t=2} | G_i = 2] - \mathbb{E}[Y_{i,t=2} | G_i = \infty] \\[6pt] &= \mathbb{E}[Y_{i,t=2}(2) | G_i = 2] - \mathbb{E}[Y_{i,t=2}(\infty) | G_i = \infty] \\[6pt] &= \underbrace{\mathbb{E}[Y_{i,t=2}(2) - Y_{i,t=2}(\infty) | G_i = 2]}_{ATT} + \underbrace{\mathbb{E}[Y_{i,t=2}(\infty) | G_i = 2] - \mathbb{E}[Y_{i,t=2}(\infty) | G_i = \infty]}_{\color{#b91c1c}{\text{Selection Bias}}} \end{aligned} \]
| Unit | \(Y_{i,t=1}(\infty)\) | Wage | \(Y_{i,t=2}(\infty)\) | \(Y_{i,t=2}(2)\) |
|---|---|---|---|---|
| A (Trained) | 20 | 20 | 22 | 27 |
| B (Trained) | 18 | 18 | 20 | 24 |
| C (Not trained) | 30 | 30 | 32 | — |
| D (Not trained) | 28 | 28 | 30 | — |
Selection bias contaminates naive comparisons.
What assumptions let us recover the ATT?
Our first assumption ensures that future treatment does not contaminate pre-treatment outcomes:
No-Anticipation. For all units \(i\): \(Y_{i,t=1}(2) = Y_{i,t=1}(\infty)\). Treatment at \(t=2\) has no effect on outcomes at \(t=1\).
For ATT identification, we only need this on average across treated units: \(\mathbb{E}[Y_{i,t=1}(2) | G_i = 2] = \mathbb{E}[Y_{i,t=1}(\infty) | G_i = 2]\)
Parallel Trends (PT). \[ \mathbb{E}[Y_{i,t=2}(\infty) - Y_{i,t=1}(\infty) \;|\; G_i = 2] = \mathbb{E}[Y_{i,t=2}(\infty) - Y_{i,t=1}(\infty) \;|\; G_i = \infty]. \]
This PT involves counterfactual outcomes \(\Rightarrow\) fundamentally untestable. We can assess plausibility using pre-treatment data (in a few lectures).
PT is fundamentally untestable — it concerns counterfactual trends, not observed trends
Suggestive evidence: Pre-treatment trends can provide support
Potential violations in the Medicaid example:
Best practice: Argue for PT using institutional knowledge, not just pre-trend tests. We will revisit this in future lectures
Goal: Show that \(\theta^{DiD} = ATT\) under SUTVA + No-Anticipation + PT.
Start from the ATT:
\[ ATT = \underbrace{\mathbb{E}[Y_{i,t=2}(2) | G_i = 2]}_{\color{#15803d}{\text{observable}}} - \underbrace{\mathbb{E}[Y_{i,t=2}(\infty) | G_i = 2]}_{\color{#b91c1c}{\text{counterfactual}}} \]
Use PT to impute the counterfactual:
\[ \mathbb{E}[Y_{i,t=2}(\infty) | G_i = 2] = \mathbb{E}[Y_{i,t=1}(\infty) | G_i = 2] + \mathbb{E}[Y_{i,t=2}(\infty) | G_i = \infty] - \mathbb{E}[Y_{i,t=1}(\infty) | G_i = \infty] \]
Under No-Anticipation + SUTVA, all terms are observable:
\[ \mathbb{E}[Y_{i,t=2}(\infty) | G_i = 2] = \underbrace{\mathbb{E}[Y_{i,t=1} | G_i = 2]}_{\color{#15803d}{\checkmark\; \text{No-Anticipation}}} + \underbrace{\mathbb{E}[Y_{i,t=2} | G_i = \infty] - \mathbb{E}[Y_{i,t=1} | G_i = \infty]}_{\color{#15803d}{\checkmark\; \text{all observable}}} \]
Theorem (DiD Identification). Under SUTVA, No-Anticipation, and Parallel Trends: \[ ATT = \theta^{DiD} = \Big(\mathbb{E}[Y_{i,t=2} | G_i = 2] - \mathbb{E}[Y_{i,t=1} | G_i = 2]\Big) - \Big(\mathbb{E}[Y_{i,t=2} | G_i = \infty] - \mathbb{E}[Y_{i,t=1} | G_i = \infty]\Big) \]
Baker et al. (2025) emphasize: the weights define the ATT
Unweighted DiD:
\[ATT = \mathbb{E}[Y_{i,t=2}(2) - Y_{i,t=2}(\infty) | G_i = 2]\]
Population-weighted DiD:
\[ATT_\omega = \mathbb{E}_\omega[Y_{i,t=2}(2) - Y_{i,t=2}(\infty) | G_i = 2]\]
Key message: Both are valid ATTs, but they answer different questions
We know what to estimate.
Now: how to estimate and do inference.
\[ \widehat{\theta}^{DiD} = \big(\bar{Y}_{g=2,t=2} - \bar{Y}_{g=2,t=1}\big) - \big(\bar{Y}_{g=\infty,t=2} - \bar{Y}_{g=\infty,t=1}\big) \]
With panel data (same units in both periods), define \(\Delta Y_i = Y_{i,t=2} - Y_{i,t=1}\)
The DiD estimator as a two-sample difference in means: \[ \widehat{\theta}^{DiD} = \overline{\Delta Y}_2 - \overline{\Delta Y}_\infty = \frac{1}{n_2} \sum_{i:\, G_i = 2} \Delta Y_i - \frac{1}{n_\infty} \sum_{i:\, G_i = \infty} \Delta Y_i \]
Two steps:
Requires panel data:
We use \(G_i\) notation consistently for conditioning and parameters: \(ATT = \mathbb{E}[Y_{i,t=2}(2) - Y_{i,t=2}(\infty) \;|\; G_i = 2]\)
For regression and influence function formulas, define a binary shorthand: \[ D_i = \mathbf{1}\{G_i = 2\} = \begin{cases} 1 & \text{if unit } i \text{ is in the treated group} \\ 0 & \text{if unit } i \text{ is in the comparison group} \end{cases} \]
Why this convention?
Rule: All conditioning, parameters, and potential outcomes use \(G_i\); \(D_i\) appears only in regression specifications and influence function formulas
A common way to estimate DiD: Two-Way Fixed Effects (TWFE) regression
Pooled OLS form:
\[Y_{i,t} = \alpha_0 + \gamma_0 D_i + \lambda_0 T_t + \beta^{TWFE} (D_i \times T_t) + \varepsilon_{i,t}\]
where \(T_t = \mathbf{1}\{t = 2\}\), \(D_i = \mathbf{1}\{G_i = 2\}\)
Unit & time FE form:
\[Y_{i,t} = \alpha_i + \lambda_t + \beta^{TWFE} D_{i,t} + \varepsilon_{i,t}\]
where \(D_{i,t} = \mathbf{1}\{G_i \leq t\}\)
\[ \begin{aligned} \mathbb{E}_n[\varepsilon_{i,t}] &= 0 & &\Rightarrow \hat{\alpha}_0 = \bar{Y}_{g=\infty,t=1} \\ \mathbb{E}_n[D_i \cdot \varepsilon_{i,t}] &= 0 & &\Rightarrow \hat{\gamma}_0 = \bar{Y}_{g=2,t=1} - \bar{Y}_{g=\infty,t=1} \\ \mathbb{E}_n[T_t \cdot \varepsilon_{i,t}] &= 0 & &\Rightarrow \hat{\lambda}_0 = \bar{Y}_{g=\infty,t=2} - \bar{Y}_{g=\infty,t=1} \\ \mathbb{E}_n[(D_i \cdot T_t) \cdot \varepsilon_{i,t}] &= 0 & &\Rightarrow \hat{\beta}^{TWFE} = \widehat{\theta}^{DiD} \end{aligned} \]
Baker et al. (2025): Three OLS specifications, same \(\hat{\beta}\)
| Spec | Regression | Data Used |
|---|---|---|
| (1) Pooled OLS | \(Y_{i,t} = \alpha + \gamma D_i + \lambda T_t + \beta (D_i \cdot T_t) + \varepsilon_{i,t}\) | Panel (\(2n\) obs) |
| (2) First-diff | \(\Delta Y_i = \delta + \beta D_i + u_i\) | Panel, FD (\(n\) obs) |
| (3) Unit & time FE | \(Y_{i,t} = \alpha_i + \lambda_t + \beta D_{i,t} + \varepsilon_{i,t}\) | Panel (\(2n\) obs) |
Important conceptual point: The regression is a computational device
In the \(2 \times 2\) case, OLS does not add any statistical content
Why use regression then?
Warning for later: In more complex settings (staggered adoption, heterogeneous effects), TWFE \(\neq\) “the” DiD estimator
Data: County-level mortality rates (ages 20–64), deaths per 100,000
pedrohcgs/JEL-DiDTreatment: State-level Medicaid expansion under the ACA
\(2 \times 2\) Setup:
Sample: $$2,300 counties, roughly 900 treated and 1,400 comparison
Key feature: Counties vary enormously in population size — weighting matters!
Baker et al. (2025), Table 2: Simple means and DiD
| Pre (2013) | Post (2014) | Pre (2013) | Post (2014) | |
|---|---|---|---|---|
| Unweighted | Pop-Weighted | |||
| Treated (\(G_i=2\)) | 419.2 | 428.5 | 322.7 | 326.5 |
| Comparison (\(G_i=\infty\)) | 474.0 | 483.1 | 376.4 | 382.7 |
| \(\Delta\) Treated | +9.3 | +3.7 | ||
| \(\Delta\) Comparison | +9.1 | +6.3 | ||
| DiD | +0.1 | $-$2.6 |
| (1) Pooled OLS | (2) First-diff | (3) Unit & time FE | |
|---|---|---|---|
| \(\hat{\beta}\) (unweighted) | 0.1 | 0.1 | 0.1 |
| SE (county-clustered) | (3.7) | (3.7) | (3.7) |
| \(\hat{\beta}\) (pop-weighted) | $-$2.6 | $-$2.6 | $-$2.6 |
| SE (county-clustered) | (1.5) | (1.5) | (1.5) |
| Observations | \(2n\) | \(n\) | \(2n\) |
DiD by hand:
short_data: balanced panel with 2013–2014 only (the \(2 \times 2\) subset)github.com/pedrohcgs/JEL-DiDLesson from the Medicaid application:
We have seen DiD work in practice.
Now: what are its statistical properties?
Having seen DiD work in practice, we now formalize the statistical properties of the estimator.
Panel Data Sampling. We observe an iid random sample \(\{Z_i\}_{i=1}^n\) where \(Z_i = (Y_{i,t=1}, Y_{i,t=2}, G_i)\) is drawn from the joint distribution of \((Y_{i,t=1}, Y_{i,t=2}, G_i)\).
Write the DiD estimator using empirical expectations: \[ \widehat{\theta}^{DiD} = \frac{\mathbb{E}_n[\Delta Y_i \cdot D_i]}{\mathbb{E}_n[D_i]} - \frac{\mathbb{E}_n[\Delta Y_i \cdot (1 - D_i)]}{\mathbb{E}_n[1 - D_i]} \]
This is a smooth function of sample means: \(\widehat{\theta}^{DiD} = g(\mathbb{E}_n[\mathbf{m}(Z_i)])\)
where \(\mathbf{m}(Z_i) = (\Delta Y_i \cdot D_i, \; D_i, \; \Delta Y_i \cdot (1-D_i), \; 1 - D_i)'\)
By the Law of Large Numbers: \(\mathbb{E}_n[\mathbf{m}(Z_i)] \xrightarrow{p} \mathbb{E}[\mathbf{m}(Z_i)]\)
By the Continuous Mapping Theorem: \(\widehat{\theta}^{DiD} = g(\mathbb{E}_n[\mathbf{m}(Z_i)]) \xrightarrow{p} g(\mathbb{E}[\mathbf{m}(Z_i)]) = \theta^{DiD}\)
\(\Rightarrow\) Consistency follows from LLN + CMT
Apply the delta method to \(g(\mathbb{E}_n[\mathbf{m}(Z_i)])\): \[ \sqrt{n}(\widehat{\theta}^{DiD} - \theta^{DiD}) = \sqrt{n} \cdot \nabla g(\boldsymbol{\mu})' \big(\mathbb{E}_n[\mathbf{m}(Z_i)] - \boldsymbol{\mu}\big) + o_p(1) \] where \(\boldsymbol{\mu} = \mathbb{E}[\mathbf{m}(Z_i)]\)
By the CLT: \[ \sqrt{n}\big(\mathbb{E}_n[\mathbf{m}(Z_i)] - \boldsymbol{\mu}\big) \xrightarrow{d} N(\mathbf{0}, \text{Var}(\mathbf{m}(Z_i))) \]
Combining (Slutsky): \[ \sqrt{n}(\widehat{\theta}^{DiD} - \theta^{DiD}) \xrightarrow{d} N\big(0, \;\nabla g(\boldsymbol{\mu})' \text{Var}(\mathbf{m}(Z_i)) \nabla g(\boldsymbol{\mu})\big) \]
This can be written more compactly using the influence function…
Panel data IF. Under the panel data sampling scheme: \[ \psi_i^{p} = \underbrace{\frac{D_i}{p}}_{w_1(D_i)} \big(\Delta Y_i - \mu_{\Delta,2}\big) \;-\; \underbrace{\frac{1 - D_i}{1 - p}}_{w_0(D_i)} \big(\Delta Y_i - \mu_{\Delta,\infty}\big) \] where \(p = \Pr(G_i = 2)\), \(\mu_{\Delta,g} = \mathbb{E}[\Delta Y_i | G_i = g]\) for \(g \in \{2,\infty\}\).
Suppose: \(p = 0.4\), \(\mu_{\Delta,2} = -4.0\), \(\mu_{\Delta,\infty} = -1.4\)
Expansion county with \(\Delta Y_i = -6.5\): \[ \psi_i^p = \tfrac{1}{0.4}(-6.5 - (-4.0)) - 0 = \textcolor{#b91c1c}{-6.25} \] This county “pulls” the estimate toward a larger mortality reduction
Non-expansion county with \(\Delta Y_i = -0.5\): \[ \psi_i^p = 0 - \tfrac{1}{0.6}(-0.5 - (-1.4)) = \textcolor{#15803d}{-1.5} \] Its mortality fell less than average \(\Rightarrow\) supports the treatment effect
Takeaway: The IF assigns each unit a signed “credit” for the overall estimate
Two terms: treated group’s contribution and comparison group’s contribution
Each term is a demeaned quantity, weighted by group proportion
\(\mathbb{E}[\psi_i^p] = 0\) by construction — the IF is centered
Intuition for the weights:
Why this matters: The IF gives us everything for inference — consistency, asymptotic normality, variance estimation, and bootstrap validity all follow from this representation
Theorem. Under the panel data sampling scheme, SUTVA, No-Anticipation, and \(\text{Var}(\Delta Y_i | G_i = g) < \infty\): \[ \sqrt{n}\big(\widehat{\theta}^{DiD} - \theta^{DiD}\big) \xrightarrow{d} N(0, V_p) \] where \(V_p = \text{Var}(\psi_i^p) = \frac{\sigma^2_{\Delta,2}}{p} + \frac{\sigma^2_{\Delta,\infty}}{1-p}\)
With repeated cross-sections (RCS), different units sampled at \(t=1\) and \(t=2\)
Cannot first-difference \(\Rightarrow\) must estimate four means separately
The IF has four components instead of two: \[ \psi_i^{rc} = w_1(D_i, T_i) \cdot (Y_i - \mu_{2,T_i}) - w_0(D_i, T_i) \cdot (Y_i - \mu_{\infty,T_i}) \] where each weight depends on both group and period membership
Requires additional assumption: Stationarity of group composition
Panel is strictly more efficient than RCS — panel exploits within-unit correlation
Given: \(\sqrt{n}(\widehat{\theta}^{DiD} - \theta^{DiD}) \xrightarrow{d} N(0, V_p)\)
Variance estimation (analogy principle): \[ \widehat{V}_p = \frac{1}{n} \sum_{i=1}^n \big(\widehat{\psi}_i^p\big)^2 \] where \(\widehat{\psi}_i^p\) replaces population quantities with sample analogs
Standard error: \(\widehat{SE} = \sqrt{\widehat{V}_p / n}\)
Confidence interval: \(\widehat{\theta}^{DiD} \pm z_{\alpha/2} \cdot \widehat{SE}\)
\(t\)-statistic: \(t = \widehat{\theta}^{DiD} / \widehat{SE}\). Reject \(H_0: \theta^{DiD} = 0\) if \(|t| > z_{\alpha/2}\)
But wait: Should we cluster the standard errors?
Key advantage of IF-based inference: The multiplier bootstrap
Idea: Perturb the IF with random weights instead of resampling data \[ \widehat{\theta}^{*,b} = \widehat{\theta}^{DiD} + \frac{1}{n} \sum_{i=1}^n U_i^{(b)} \cdot \widehat{\psi}_i^p \] where \(U_i^{(b)} \sim N(0,1)\) are iid random weights
No re-estimation needed: Each bootstrap draw is a simple weighted sum
For cluster-level inference: Draw \(U_s^{(b)}\) at the cluster level
Multiplier Bootstrap for DiD (Cluster-Robust)
Standard cluster-robust inference relies on \(S \to \infty\) (number of clusters)
Problem: In many DiD applications, \(S\) is small (e.g., 50 states, 10 provinces)
Approaches in the literature:
No silver bullet: Each approach requires additional assumptions; this is an active area of research (see Alvarez, Ferman, and Wüthrich 2025 for a recent survey)
We have assumed panel data. But what if we do not have it?
What changes with repeated cross-sections?
Repeated Cross-Section Sampling. Period \(t\) data \(\{(Y_{i,t}, G_i)\}\) is an iid sample from \(F_{Y,G|T=t}\). Observations across periods are independent.
Different units sampled at \(t=1\) and \(t=2\)
Requires additional assumption: Stationarity of group composition
IF has four components: one per group-period cell
Examples: CPS microdata, Census data, polling data before/after events
Panel Data
Repeated Cross-Sections
Key takeaways from the \(2 \times 2\) DiD framework:
Identification: SUTVA + No-Anticipation + Parallel Trends \(\Rightarrow\) DiD identifies the ATT using four observable group-time means
Estimation: DiD-by-hand and TWFE regression are numerically identical in the \(2 \times 2\) case — regression is just a convenient computational device
Inference: Influence functions provide consistency, asymptotic normality, and variance estimation. Always cluster at least at the unit level; ideally at the treatment-assignment level
Weights matter: The choice of weights (unweighted vs. population-weighted) defines a different target parameter
Appendix
Detailed derivations and proofs
Setup: \(Z_i = (Y_{i,t=1}, Y_{i,t=2}, G_i)\) iid, \(D_i = \mathbf{1}\{G_i = 2\}\), \(\Delta Y_i = Y_{i,t=2} - Y_{i,t=1}\)
Define population quantities: \[ \begin{aligned} \mu_1 &= \mathbb{E}[\Delta Y_i \cdot D_i], \quad p = \mathbb{E}[D_i] \\ \mu_0 &= \mathbb{E}[\Delta Y_i \cdot (1-D_i)], \quad 1-p = \mathbb{E}[1-D_i] \end{aligned} \]
The DiD estimand: \(\theta^{DiD} = \frac{\mu_1}{p} - \frac{\mu_0}{1-p}\)
Sample analogs: \[ \widehat{\theta}^{DiD} = \frac{\mathbb{E}_n[\Delta Y_i \cdot D_i]}{\mathbb{E}_n[D_i]} - \frac{\mathbb{E}_n[\Delta Y_i \cdot (1-D_i)]}{\mathbb{E}_n[1-D_i]} \]
By LLN: \(\mathbb{E}_n[\Delta Y_i \cdot D_i] \xrightarrow{p} \mu_1\), etc.
By CMT (continuous mapping theorem): \(\widehat{\theta}^{DiD} \xrightarrow{p} \theta^{DiD}\) \(\square\)
Goal: Show \(\sqrt{n}(\widehat{\theta}^{DiD} - \theta^{DiD}) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \psi_i^p + o_p(1)\)
Step 1: Write the estimator as a function of sample means. Let \(\hat{\mu}_1 = \mathbb{E}_n[\Delta Y_i D_i]\), \(\hat{p} = \mathbb{E}_n[D_i]\), \(\hat{\mu}_0 = \mathbb{E}_n[\Delta Y_i(1-D_i)]\), \(\hat{q} = \mathbb{E}_n[1-D_i]\) \[ \widehat{\theta}^{DiD} = \frac{\hat{\mu}_1}{\hat{p}} - \frac{\hat{\mu}_0}{\hat{q}} \]
Step 2: Linearize each ratio. For the first term: \[ \begin{aligned} \frac{\hat{\mu}_1}{\hat{p}} - \frac{\mu_1}{p} &= \frac{\hat{\mu}_1 p - \mu_1 \hat{p}}{\hat{p} \cdot p} \\ &= \frac{(\hat{\mu}_1 - \mu_1) p - \mu_1 (\hat{p} - p)}{\hat{p} \cdot p} \\ &= \frac{1}{p}\big(\hat{\mu}_1 - \mu_1\big) - \frac{\mu_1}{p^2}(\hat{p} - p) + o_p(n^{-1/2}) \end{aligned} \] where the last step uses \(\hat{p} \xrightarrow{p} p > 0\).
Step 3: Similarly for the second ratio: \[ \frac{\hat{\mu}_0}{\hat{q}} - \frac{\mu_0}{1-p} = \frac{1}{1-p}(\hat{\mu}_0 - \mu_0) + \frac{\mu_0}{(1-p)^2}(\hat{p} - p) + o_p(n^{-1/2}) \]
Step 4: Combine: \[ \begin{aligned} \widehat{\theta}^{DiD} - \theta^{DiD} &= \frac{1}{p}(\hat{\mu}_1 - \mu_1) - \frac{\mu_{\Delta,2}}{p}(\hat{p} - p) \\ &\quad - \frac{1}{1-p}(\hat{\mu}_0 - \mu_0) - \frac{\mu_{\Delta,\infty}}{1-p}(\hat{p} - p) + o_p(n^{-1/2}) \end{aligned} \] where \(\mu_{\Delta,2} = \mu_1/p = \mathbb{E}[\Delta Y_i | G_i = 2]\) and \(\mu_{\Delta,\infty} = \mu_0/(1-p) = \mathbb{E}[\Delta Y_i | G_i = \infty]\).
Note: \(\hat{p} - p = \mathbb{E}_n[D_i] - p\) and \(\hat{\mu}_1 - \mu_1 = \mathbb{E}_n[\Delta Y_i D_i] - \mu_1\)
Step 5: Express as average of iid terms. Each sample mean is an average: \[ \begin{aligned} \widehat{\theta}^{DiD} - \theta^{DiD} &= \frac{1}{n}\sum_{i=1}^n \bigg[ \frac{D_i \Delta Y_i - \mu_1}{p} - \frac{\mu_{\Delta,2}(D_i - p)}{p} \\ &\qquad\qquad - \frac{(1-D_i)\Delta Y_i - \mu_0}{1-p} - \frac{\mu_{\Delta,\infty}(D_i - p)}{1-p} \bigg] + o_p(n^{-1/2}) \end{aligned} \]
Step 6: Simplify the treated group term: \[ \frac{D_i \Delta Y_i - \mu_1}{p} - \frac{\mu_{\Delta,2}(D_i - p)}{p} = \frac{D_i(\Delta Y_i - \mu_{\Delta,2})}{p} \]
Similarly for the comparison term. Thus: \[ \widehat{\theta}^{DiD} - \theta^{DiD} = \frac{1}{n}\sum_{i=1}^n \underbrace{\left[\frac{D_i}{p}(\Delta Y_i - \mu_{\Delta,2}) - \frac{1-D_i}{1-p}(\Delta Y_i - \mu_{\Delta,\infty})\right]}_{\psi_i^p} + o_p(n^{-1/2}) \]
From the IF representation: \(\sqrt{n}(\widehat{\theta}^{DiD} - \theta^{DiD}) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \psi_i^p + o_p(1)\)
By CLT: \(\frac{1}{\sqrt{n}}\sum_{i=1}^n \psi_i^p \xrightarrow{d} N(0, \text{Var}(\psi_i^p))\)
Computing \(\text{Var}(\psi_i^p)\): \[ \begin{aligned} V_p &= \text{Var}\left(\frac{D_i}{p}(\Delta Y_i - \mu_{\Delta,2}) - \frac{1-D_i}{1-p}(\Delta Y_i - \mu_{\Delta,\infty})\right) \\ &= \frac{1}{p^2}\text{Var}(D_i(\Delta Y_i - \mu_{\Delta,2})) + \frac{1}{(1-p)^2}\text{Var}((1-D_i)(\Delta Y_i - \mu_{\Delta,\infty})) \end{aligned} \] (cross-term is zero since \(D_i(1-D_i) = 0\))
Simplifying: \(\text{Var}(D_i(\Delta Y_i - \mu_{\Delta,2})) = p \cdot \sigma^2_{\Delta,2}\) where \(\sigma^2_{\Delta,2} = \text{Var}(\Delta Y_i | G_i = 2)\)
\[ \boxed{V_p = \frac{\sigma^2_{\Delta,2}}{p} + \frac{\sigma^2_{\Delta,\infty}}{1-p}} \]
Plug-in estimator: Replace population quantities with sample analogs \[ \widehat{\psi}_i^p = \frac{D_i}{\hat{p}}(\Delta Y_i - \overline{\Delta Y}_2) - \frac{1-D_i}{1-\hat{p}}(\Delta Y_i - \overline{\Delta Y}_\infty) \]
Variance estimate: \[ \widehat{V}_p = \frac{1}{n}\sum_{i=1}^n (\widehat{\psi}_i^p)^2 \]
By LLN: \(\widehat{V}_p \xrightarrow{p} V_p\)
Standard error: \(\widehat{SE}(\widehat{\theta}^{DiD}) = \sqrt{\widehat{V}_p / n}\)
Alternative: Direct plug-in \[ \widehat{V}_p^{alt} = \frac{\hat{\sigma}^2_{\Delta,2}}{\hat{p}} + \frac{\hat{\sigma}^2_{\Delta,\infty}}{1 - \hat{p}} \] where \(\hat{\sigma}^2_{\Delta,g} = \frac{1}{n_g}\sum_{i:G_i=g}(\Delta Y_i - \overline{\Delta Y}_g)^2\)
RCS data: \(Z_i = (Y_i, G_i, T_i)\), \(D_i = \mathbf{1}\{G_i = 2\}\), \(T_i = \mathbf{1}\{\text{sampled at } t=2\}\) (distinct from \(T_t\) in regression)
Cannot first-difference. The DiD estimand uses four conditional means: \[ \theta^{DiD} = (\mu_{2,2} - \mu_{2,1}) - (\mu_{\infty,2} - \mu_{\infty,1}) \] where \(\mu_{g,t} = \mathbb{E}[Y_i | G_i = g, T_i = t]\)
Under stationarity: \(\Pr(G_i = 2 | T_i = t)\) is constant across \(t\)
Linearization: Apply the delta method to each of the four ratios
The IF takes the form: \[ \begin{aligned} \psi_i^{rc} &= \frac{D_i T_i}{p \cdot \lambda}(Y_i - \mu_{2,2}) - \frac{D_i(1-T_i)}{p \cdot (1-\lambda)}(Y_i - \mu_{2,1}) \\ &\quad - \frac{(1-D_i)T_i}{(1-p)\cdot\lambda}(Y_i - \mu_{\infty,2}) + \frac{(1-D_i)(1-T_i)}{(1-p)(1-\lambda)}(Y_i - \mu_{\infty,1}) \end{aligned} \] where \(\lambda = \Pr(T_i = 1)\) and \(\mu_{g,t} = \mathbb{E}[Y_i | G_i = g, T_i = t]\)
Asymptotic variance: \(V_{rc} = \text{Var}(\psi_i^{rc})\)
Since \((D_i, T_i) \in \{0,1\}^2\) creates four disjoint groups, cross-products vanish: \[ V_{rc} = \frac{\sigma^2_{2,2}}{p \cdot \lambda} + \frac{\sigma^2_{2,1}}{p \cdot (1-\lambda)} + \frac{\sigma^2_{\infty,2}}{(1-p) \cdot \lambda} + \frac{\sigma^2_{\infty,1}}{(1-p)(1-\lambda)} \] where \(\sigma^2_{g,t} = \text{Var}(Y_i | G_i = g, T_i = t)\)
Comparison with panel: \[ V_p = \frac{\sigma^2_{\Delta,2}}{p} + \frac{\sigma^2_{\Delta,\infty}}{1-p} \;\;<\;\; V_{rc} \]
Two sources of RCS inefficiency:
Conclusion: Panel data is strictly more efficient than RCS for DiD (\(V_p < V_{rc}\) always)
Question: Given a total budget of \(n\) observations, how should we split between \(t=1\) and \(t=2\) in an RCS?
Let \(\lambda = n_2/(n_1 + n_2)\) be the fraction sampled at \(t=2\)
The asymptotic variance is: \[ V_{rc}(\lambda) = \frac{1}{\lambda}\left(\frac{\sigma^2_{2,2}}{p} + \frac{\sigma^2_{\infty,2}}{1-p}\right) + \frac{1}{1-\lambda}\left(\frac{\sigma^2_{2,1}}{p} + \frac{\sigma^2_{\infty,1}}{1-p}\right) \]
Optimal allocation: \[ \lambda^* = \frac{\sqrt{A_2}}{\sqrt{A_1} + \sqrt{A_2}} \] where \(A_t = \frac{\sigma^2_{2,t}}{p} + \frac{\sigma^2_{\infty,t}}{1-p}\)
Practical implication: If the outcome is more variable in the post-period (e.g., due to treatment effect heterogeneity), sample more observations in the post-period

ECON 730 | Causal Panel Data | Pedro H. C. Sant’Anna