ECON 730: Causal Inference with Panel Data

Lecture 6: Incorporating Covariates into DiD

Pedro H. C. Sant’Anna

Emory University

Spring 2026

Roadmap

Why Covariates?

Applications, conditional PT, TWFE fragility

Three Strategies

RA, IPW, doubly robust

III

Design & Applications

Balance, Medicaid, Brazil CAPS

Repeated Cross-Sections

Compositional changes

ML & DiD

LASSO, cross-fitting, causal forests

Building on Lecture 5 ($2 \times 2$ DiD without covariates)

Act I: Why Covariates?

Where We Left Off

Lecture 5 established the $2 \times 2$ DiD framework:

SUTVA + No-Anticipation + Unconditional Parallel Trends $\Rightarrow$ ATT identified
DiD-by-hand $=$ TWFE regression (numerically identical in $2 \times 2$)
Influence functions provide asymptotic theory; always cluster
Weights define the target parameter

But is unconditional parallel trends realistic?
What if treated and comparison units differ systematically in pre-treatment characteristics?

Meet the Applications

Application 1: ACA Medicaid Expansion (Baker et al. 2025)
- Effect of Medicaid expansion on county-level mortality ($2 \times 2$: 2013–2014)
- Expansion states differ from non-expansion states in demographics, income, poverty
Application 2: Brazil Psychiatric Reform (Dias and Fontes 2024)
- Community mental health centers (CAPS) replaced psychiatric hospitals
- 5,180 municipalities, staggered rollout 2002–2016
- For this lecture: $2 \times 2$ — CAPS adopters in 2006 vs. never-treated, pre/post = 2005/2007
- Outcome: assault homicide rate per 10,000 population
Both applications: treated and comparison groups differ in pre-treatment characteristics
Question: Can we still use unconditional PT?

Medicaid: How Different Are the Groups?

Treated states (expanded Medicaid in 2014) vs. comparison states (never expanded)
Let’s look at pre-treatment characteristics:
- % below poverty line, median household income
- % white, % Hispanic, urbanization rate
- Pre-treatment mortality trends
Do these groups look comparable?

Medicaid: Covariate Balance

Medicaid: Raw Trends

Brazil CAPS: Covariate Imbalance Across Dimensions

Brazil CAPS: Standardized Differences

Brazil CAPS: Raw Trends

The Core Idea: Conditional Parallel Trends

Sometimes the unconditional PT assumption is too strong
But PT may be plausible within subgroups defined by pre-treatment characteristics $X_i$
Intuition: “Among counties with similar demographics, treated and comparison counties would have trended similarly absent treatment”
This is the conditional parallel trends assumption

Key insight: Covariates can make the PT assumption more credible, but we need appropriate estimation methods to exploit them.

Notation: Review from Lecture 5

Two periods: $t = 1$ (pre-treatment) and $t = 2$ (post-treatment)
Two groups: $G_i \in \{2, \infty\}$, with $D_i = \mathbf{1}\{G_i = 2\}$
Potential outcomes: $Y_{i,t}(g)$ for each treatment timing $g$
Target parameter:

\[\text{ATT} = \mathbb{E}[Y_{i,t=2}(2) - Y_{i,t=2}(\infty) \mid G_i = 2]\]

Notation: New for Covariates

$X_i$: vector of pre-treatment covariates (observed before treatment)
$p(X_i) \equiv \mathbb{P}(D_i = 1 \mid X_i)$: generalized propensity score
$p \equiv \mathbb{P}(D_i = 1) = \mathbb{E}[D_i]$: unconditional treatment probability
$T_i \in \{1, 2\}$: period indicator for unit $i$ (in panel data, each unit observed in both; in RCS, $T_i$ is the sampling period)
Later we will also define:
- $m_\Delta^{d}(x) \equiv \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = d]$: conditional mean outcome change
- $\text{CATT}(x)$: conditional ATT given $X_i = x$

Conditional Parallel Trends Assumption

PT may hold only within covariate subgroups:

Assumption (Conditional PT). $\mathbb{E}[\Delta Y_i(\infty) \mid X_i, D_i\!=\!1] = \mathbb{E}[\Delta Y_i(\infty) \mid X_i, D_i\!=\!0]$ a.s.

Conditional on $X_i$, the average evolution of $Y(\infty)$ is the same for treated and comparison
Allows for covariate-specific trends: outcome evolution can depend on $X_i$
Remark: Caetano and Callaway (2024) condition on $(X_{t^*}, X_{t^*-1}, Z)$, allowing time-varying $X$. We restrict to pre-determined baseline $X_i$ to avoid “bad controls” concerns (Angrist and Pischke 2009).

The Overlap Assumption

For identification, we also need treated units to have comparable controls at every covariate value:

Assumption (Strong Overlap). For some $\epsilon > 0$, $\mathbb{P}(D_i = 1 \mid X_i) < 1 - \epsilon$ almost surely.

Every treated unit must have comparison units with similar covariate values
Without overlap: we cannot learn about the counterfactual for some treated units
For identification: can relax to $\epsilon = 0$ (boundary case)
For standard inference: need $\epsilon > 0$ to avoid irregularity (Khan and Tamer 2010)
Closely related to overlap conditions in the matching/weighting literature (Crump et al. 2009)
Note: for ATT, we only need $p(X_i)$ bounded away from 1, not from 0 — unlike ATE, which requires both bounds.

Conditional vs. Unconditional PT: A Visual

Solid = observed comparison trend. Dashed = PT counterfactual for the treated group. Within subgroups, these are parallel.

How Conditional PT Can Break Unconditional PT

Identification of ATT under Conditional PT

Assumptions: A1. SUTVA A2. No Anticipation: $Y_{i,t=1}(2) = Y_{i,t=1}(\infty)$ A3. Conditional PT A4. Overlap

Step 1: Identify the conditional ATT: \[\text{CATT}(x) = \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = 1] - \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = 0]\] where $\Delta Y_i \equiv Y_{i,t=2} - Y_{i,t=1}$.

Step 2: Integrate over the treated covariate distribution: \[\text{ATT} = \mathbb{E}[\text{CATT}(X_i) \mid D_i = 1]\]

We identify a very rich object: the conditional ATT function $\text{CATT}(x)$
The unconditional ATT follows by averaging over treated units’ covariates

The Practitioner’s Instinct: Add $X$ to TWFE

The most common approach in applied work: “just add covariates to the regression” \[Y_{i,t} = \alpha_i + \lambda_t + \tau D_{it} + X_{i,t}'\beta + \varepsilon_{i,t}\]
Recall from Lecture 5: without covariates, TWFE $=$ DiD-by-hand in $2 \times 2$
Many practitioners expect the same logic extends: “$\hat{\tau}$ should estimate the ATT after controlling for $X$”

This intuition is wrong.

Adding $X$ to TWFE is not the same as allowing for covariate-specific trends. The regression imposes strong — and often hidden — restrictions.

What Goes Wrong with TWFE + Covariates

Consider the TWFE specification with pooled data: \[Y_{i,t} = \tilde{\alpha}_0 + \tilde{\gamma}_0 D_i + \tilde{\lambda}_0 \mathbf{1}\{T_i\!=\!2\} + \textcolor{red}{\tilde{\beta}_0^{twfe}} \big(D_i \cdot \mathbf{1}\{T_i\!=\!2\}\big) + X_i' \tilde{\alpha}_1 + \tilde{\varepsilon}_{i,t}\]

The implied conditional means:

\[\begin{aligned} \mathbb{E}[Y \mid D\!=\!0, T\!=\!1, X] &= \tilde{\alpha}_0 + X'\tilde{\alpha}_1, \quad \mathbb{E}[Y \mid D\!=\!0, T\!=\!2, X] = \tilde{\alpha}_0 + \tilde{\lambda}_0 + X'\tilde{\alpha}_1 \\ \mathbb{E}[Y \mid D\!=\!1, T\!=\!1, X] &= \tilde{\alpha}_0 + \tilde{\gamma}_0 + X'\tilde{\alpha}_1, \quad \mathbb{E}[Y \mid D\!=\!1, T\!=\!2, X] = \tilde{\alpha}_0 + \tilde{\gamma}_0 + \tilde{\lambda}_0 + \textcolor{red}{\tilde{\beta}_0^{twfe}} + X'\tilde{\alpha}_1 \end{aligned}\]

TWFE Imposes No Covariate-Specific Trends

From the comparison group: $\mathbb{E}[Y \mid D=0, T=2, X] - \mathbb{E}[Y \mid D=0, T=1, X] = \tilde{\lambda}_0$ — the time trend does not depend on $X$!
Similarly for treated: $\mathbb{E}[Y \mid D=1, T=2, X] - \mathbb{E}[Y \mid D=1, T=1, X] = \tilde{\lambda}_0 + \tilde{\beta}_0^{twfe}$

This means: $\text{ATT}(X) = \tilde{\beta}_0^{twfe}$ for all $X$ — treatment effects are forced to be homogeneous!
The very reason we introduced covariates — allowing for covariate-specific trends — is assumed away by the TWFE specification

TWFE forces homogeneous trends — but how bad is the bias in practice?

A controlled simulation where ATT $= 0$.

TWFE Bias: Monte Carlo Evidence

Data generating process from Sant’Anna and Zhao (2020):
- $X_j \sim N(0,1)$, $j = 1, \ldots, 4$
- Propensity score: logistic in $f_{ps}(X) = 0.75(-X_1 + 0.5 X_2 - 0.25 X_3 - 0.1 X_4)$
- Outcome regression: $f_{reg}(X) = 210 + 27.4\, X_1 + 13.7(X_2 + X_3 + X_4)$
- Outcomes: $Y_{i,t}(\infty) = t \cdot f_{reg}(X_i) + v_i + \varepsilon_{i,t}$
- True $\text{ATT}(X) = 0$ for all $X$
TWFE regression: $Y_{i,t} = \alpha + \gamma D_i + \lambda \mathbf{1}\{T_i\!=\!2\} + \tau \big(D_i \cdot \mathbf{1}\{T_i\!=\!2\}\big) + X_i'\beta + \varepsilon_{i,t}$
Results ($n = 1{,}000$, 1,000 MC replications):

Average $\hat{\tau}^{twfe}$: $-16.36$ (true ATT $= 0$) — severely biased!
Coverage of 95% CI: $0\%$ — does not control size!

TWFE Bias: Density Comparison

The simulation used time-invariant $X_i$ in a pooled regression. What if covariates vary over time and we use fixed effects?

Caetano & Callaway (2024): a formal decomposition.

The FE Specification and Its Hidden Transformation

Caetano and Callaway (2024) analyze the standard fixed effects specification: \[Y_{i,t} = \theta_t + \eta_i + \alpha D_{it} + X_{i,t}'\beta + e_{i,t}\]

With two periods, the within/FD transformation eliminates $\eta_i$: \[\Delta Y_i = \alpha D_i + \Delta X_i'\beta + \Delta e_i\]
The transformation also transforms the covariates: only $\Delta X_i$ enters, not levels $X_{i,1}$
Time-invariant covariates $Z_i$ (e.g., race, region) are completely absorbed — cannot control for them
This is the hidden linearity bias: the FE/FD form reveals restrictions that the levels specification obscures

Compare with the pooled specification on the previous slides, which controls for $X_i$ in levels but forces homogeneous time trends.

Three Sources of Bias (Caetano and Callaway, 2024)

When conditional PT holds but TWFE is used with covariates:

\[\tilde{\beta}_0^{twfe} = \underbrace{\mathbb{E}[w(\Delta X) \cdot \text{ATT}(X) \mid D\!=\!1]}_{\text{weighted ATT}} + \underbrace{\text{BIAS}_A}_{\substack{\text{time-}\\\text{invariant}}} + \underbrace{\text{BIAS}_B}_{\substack{\text{levels vs.}\\\text{changes}}} + \underbrace{\text{BIAS}_C}_{\text{nonlinearity}}\]

BIAS$_A$: Time-invariant covariates $Z_i$ absorbed by first-differencing
BIAS$_B$: TWFE controls for changes $\Delta X$, not levels $X_{t-1}$
BIAS$_C$: Linear projection $\neq$ conditional expectation when nonlinear
Even the “weighted ATT” uses non-transparent weights $w(\Delta X)$ that can be negative

In Act II, we introduce three estimation strategies (RA, IPW, DR) that avoid all three biases by separating identification from estimation.

C&C showed TWFE with covariates introduces three sources of bias. How do these biases play out in real data?

Back to the Brazil application.

Back to Brazil: All Estimators Compared

TWFE with covariates is fragile.

We need better tools: separate identification from estimation.

Act II: Three Estimation Strategies

Three Faces of DiD with Covariates

The First Face of DiD with Covariates: Regression Adjustment

Idea: Model the comparison group’s outcome evolution $\mathbb{E}[\Delta Y_i \mid X_i, D_i = 0]$, then impute for treated units
With panel data, the ATT simplifies to:

\[\text{ATT} = \mathbb{E}[\Delta Y_i \mid D_i = 1] - \mathbb{E}\left[m_\Delta^{d=0}(X_i) \mid D_i = 1\right] = \mathbb{E}\left[m_\Delta^{d=1}(X_i) - m_\Delta^{d=0}(X_i) \mid D_i = 1\right]\]

where $m_\Delta^{d=0}(x) \equiv \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = 0]$

Only need to model one conditional expectation: the comparison group’s $\Delta Y$ given $X$
Originally proposed by J. J. Heckman, Ichimura, and Todd (1997) and J. Heckman et al. (1998)

Regression Adjustment: Estimation

We need to estimate $m_\Delta^{d=0}(x) \equiv \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = 0]$. A convenient choice is a linear working model: $m_\Delta^{d=0}(X_i) = X_i'\beta_0$
Step 1: Estimate $\beta_0$ by OLS using comparison units only: $\hat{\beta}_n = \left(\sum_{i: D_i=0} X_i X_i'\right)^{-1} \sum_{i: D_i=0} X_i \Delta Y_i$
Step 2: Impute for treated and average: $\hat{\theta}_n^{ra} = \frac{1}{n_1} \sum_{i: D_i = 1} \left(\Delta Y_i - X_i'\hat{\beta}_n\right)$
Any estimator of $m_\Delta^{d=0}(x)$ can be plugged in — kernel regression, random forests, LASSO, etc. Linear model is popular but consistency requires correct specification

Key difference from TWFE: regression is estimated on the comparison group only, then predictions are made for treated units. This allows covariate-specific trends.

Worked Example: RA with 5 Units

Unit $i$	$D_i$	$X_i$ (poverty %)	$\Delta Y_i$ (mortality change)
1	0 (comparison)	12	$-2.0$
2	0 (comparison)	18	$-0.5$
3	0 (comparison)	22	$+1.0$
4	1 (treated)	20	$-3.0$
5	1 (treated)	15	$-2.5$

Step 1: Regress $\Delta Y_i$ on $X_i$ using comparison units only $\Rightarrow$ $\hat{\beta} = 0.30$
Step 2: Impute for treated: $\hat{m}_\Delta^{d=0}(20) = 0.30 \times 20 = 6.0$, $\hat{m}_\Delta^{d=0}(15) = 0.30 \times 15 = 4.5$ (intercept omitted for simplicity)
Step 3: $\widehat{\text{ATT}} = \frac{1}{2}\big[(-3.0 - 6.0) + (-2.5 - 4.5)\big] = -8.0$

The key: comparison group regression tells us “what would have happened to treated units if they hadn’t been treated.”

RA: Key Properties

Consistent when the outcome model $m_\Delta^{d=0}(x)$ is correctly specified
Inconsistent when $m_\Delta^{d=0}(x)$ is misspecified
Works well when:
- $X$ is low-dimensional
- Functional form is known or well-approximated
- Good overlap (but does not explicitly reweight)

RA relies entirely on the researcher’s ability to model the comparison group’s outcome evolution.

Q: If RA is inconsistent under misspecification, why not always use a very flexible model for $m_\Delta^{d=0}(x)$?

RA in Practice: What Are We Actually Estimating?

Medicaid Expansion

$\Delta Y_i$: change in county mortality
$X_i$: poverty, income, % white, % Hispanic, urbanization
$m_\Delta^{d=0}(X_i)$: “How does mortality change in non-expansion counties with similar demographics?”

Brazil CAPS Reform

$\Delta Y_i$: change in homicide rate
$X_i$: 30 municipal characteristics + state FE
$m_\Delta^{d=0}(X_i)$: “How do homicide rates change in non-CAPS municipalities with similar characteristics?”

The RA recipe (same in both applications):

Estimate $\hat{m}_\Delta^{d=0}(x)$ using comparison units only
For each treated unit, plug in its $X_i$ to get predicted counterfactual $\hat{m}_\Delta^{d=0}(X_i)$
Compute ATT as mean residual among treated:

\[\widehat{\text{ATT}}^{ra} = \frac{1}{n_1}\sum_{i:\,D_i=1}\bigg(\underbrace{\Delta Y_i}_{\text{observed}} - \underbrace{\hat{m}_\Delta^{d=0}(X_i)}_{\text{counterfactual}}\bigg)\]

RA models outcomes directly.

What if we instead reweight observations?

The Second Face: Inverse Probability Weighting

Idea: Instead of modeling outcomes, reweight the comparison group to “look like” the treated group in covariates
Model the propensity score: $p(X_i) = \mathbb{P}(D_i = 1 \mid X_i)$ With two groups (treated vs. never-treated), the PS is a single binary model
Originally proposed by Abadie (2005):

\[\text{ATT}^{ipw} = \frac{\mathbb{E}\left[\left(D_i - \frac{(1-D_i) p(X_i)}{1 - p(X_i)}\right) \Delta Y_i\right]}{\mathbb{E}[D_i]}\]

The weights $\frac{p(X_i)}{1 - p(X_i)}$ upweight comparison units that “resemble” treated units

IPW: Normalized (Hajek) Weights

Abadie (2005)’s IPW is Horvitz–Thompson type (weights do not sum to 1); Sant’Anna and Zhao (2020) proposed Hajek-type (normalized) weights:

\[\text{ATT}^{ipw}_{std} = \mathbb{E}\bigg[\bigg(\underbrace{\frac{D_i}{\mathbb{E}[D_i]}}_{w_1(D_i)} - \underbrace{\frac{\frac{p(X_i)\,(1-D_i)}{1-p(X_i)}}{\mathbb{E}\!\left[\frac{p(X_i)\,(1-D_i)}{1-p(X_i)}\right]}}_{w_0(D_i, X_i)}\bigg) \Delta Y_i\bigg]\]

Normalized weights sum to 1 in each group $\Rightarrow$ more stable in finite samples
Both versions are consistent under correct propensity score specification

IPW Reweighting: The Intuition

IPW uses $w = p(X)/(1-p(X))$ to reweight comparison units so their covariate distribution matches the treated group.

IPW: Estimation and Key Properties

Working model: Logistic $p(X_i; \gamma_0) = \Lambda(X_i'\gamma_0)$
Step 1: Estimate $\gamma_0$ by logit MLE
Step 2: Plug in $\hat{p}(X_i)$ and compute weighted averages
Influence function accounts for estimation error in $\hat{\gamma}_n$
Consistent when propensity score is correctly specified
Inconsistent when propensity score is misspecified — even if outcome model is known!
Overlap is critical: if $p(X_i) \approx 1$, weights explode

IPW in Practice: What Are We Actually Reweighting?

Medicaid Expansion

$p(X_i)$: prob. county’s state expands Medicaid, given demographics
Counties “resembling” expansion states get upweighted; dissimilar ones downweighted

Brazil CAPS Reform

$p(X_i)$: prob. municipality adopts CAPS in 2006, given 30 covariates + state FE
Non-CAPS municipalities resembling adopters get upweighted; overlap concern: only 216 treated

The IPW recipe: (1) Estimate $\hat{p}(X_i)$ using both treated and comparison units, (2) Reweight comparison units by $\hat{p}(X_i)/(1 - \hat{p}(X_i))$, (3) Compute ATT as weighted difference:

\[\widehat{\text{ATT}}^{ipw} = \bar{\Delta Y}_{D=1} - \frac{\sum_{i:D_i=0} \hat{w}_i \cdot \Delta Y_i}{\sum_{i:D_i=0} \hat{w}_i}\]

RA vs. IPW: Complementary Strengths

	RA	IPW
Models	Outcome evolution	Treatment assignment
Consistent when	$m_\Delta^{d=0}(x)$ correct	$p(x)$ correct
Fails when	Outcome misspecified	PS misspecified
Sensitive to	Functional form	Overlap violations

RA and IPW have complementary failure modes. Can we combine them for robustness against either misspecification?

RA and IPW each rely on one model being correct.

Can we combine them for robustness against either* type of misspecification?*

The Third Face: Doubly Robust Estimation

Key idea: Combine outcome modeling (RA) with reweighting (IPW). Consistent if either the outcome model or the propensity score is correctly specified (but not necessarily both).

DR DiD Estimand (Sant’Anna and Zhao 2020): $\displaystyle \text{ATT}^{dr} = \mathbb{E}\!\left[\Big(w_1(D_i) - w_0(D_i, X_i)\Big)\Big(\Delta Y_i - m_\Delta^{d=0}(X_i)\Big)\right]$

Treated weight: $w_1(D_i) = \frac{D_i}{\mathbb{E}[D_i]}$

Comparison weight: $w_0(D_i, X_i) = \frac{\frac{p(X_i)\,(1-D_i)}{1-p(X_i)}}{\mathbb{E}\!\left[\frac{p(X_i)\,(1-D_i)}{1-p(X_i)}\right]}$

Why Is It Doubly Robust?

The DR estimand has two equivalent decompositions:

\[\begin{aligned} \text{ATT}^{dr} &= \underbrace{\text{ATT}^{ipw}_{std}}_{\text{IPW}} - \underbrace{\mathbb{E}\!\left[\left(w_1(D_i) - w_0(D_i, X_i)\right) m_\Delta^{d=0}(X_i)\right]}_{\text{Outcome-based bias correction}} \\[0.2em] &= \underbrace{\text{ATT}^{ra}}_{\text{RA}} - \underbrace{\mathbb{E}\!\left[w_0(D_i, X_i)\left(\Delta Y_i - m_\Delta^{d=0}(X_i)\right)\right]}_{\text{Reweighting-based bias correction}} \end{aligned}\]

If $p(x)$ correct: $w_0$ rebalances $\Rightarrow$ first line’s correction is mean-zero $\Rightarrow$ consistent. If $m_\Delta^{d=0}(x)$ correct: residuals are mean-zero $\Rightarrow$ second line’s correction vanishes $\Rightarrow$ consistent
If both wrong: generally inconsistent, but bias is product of two errors

Double Robustness: A Scorecard

DR in Practice: Two Models, Two Chances

Medicaid Expansion

OR: how does mortality change in non-expansion counties with similar demographics?
PS: which counties look like expansion counties based on demographics?
6 covariates — both models tractable

Brazil CAPS Reform

OR: how do homicide rates change in non-CAPS municipalities with similar characteristics?
PS: which municipalities look like CAPS adopters, given 30 covariates + state FE?
High-dimensional $X$ — DR’s insurance especially valuable

The DR recipe (same in both applications):

Estimate both $\hat{m}_\Delta^{d=0}(x)$ (on comparison units) and $\hat{p}(X_i)$ (on all units)
Combine: use IPW weights and outcome residuals $\Delta Y_i - \hat{m}_\Delta^{d=0}(X_i)$
If either model is correct, the other’s errors wash out $\Rightarrow$ two chances to get it right

DR is consistent if either model is correct.

But what happens to precision when both are right?

The Semiparametric Efficiency Bound

Q: If DR gives us two chances, why care about getting both models right?

Sant’Anna and Zhao (2020) derive the semiparametric efficiency bound for ATT under conditional PT. The bound equals the variance of the efficient influence function:

\[\psi_i^{eff} = \big(w_1(D_i) - w_0(D_i, X_i; p_0)\big)\big(\Delta Y_i - m_{\Delta}^{d=0}(X_i)\big) - w_1(D_i)\cdot\text{ATT}\]

The DR estimand’s IF equals the efficient IF when both models are correct
DR attains the semiparametric efficiency bound when both $m_\Delta^{d=0}$ and $p$ are correctly specified — it is locally efficient

Improved vs. Traditional DR

Sant’Anna and Zhao (2020) propose two versions:
- Traditional DR (drdid_panel): standard logit PS + OLS outcome model
- Improved DR (drdid_imp_panel): inverse probability tilted PS (Graham, Pinto, and Egel 2012) + weighted OLS outcome model
The improved version ensures the estimated PS satisfies an exact balancing condition, improving finite-sample performance
Both are doubly robust and locally efficient under correct specification
Bonus: The improved DR estimator is also doubly robust for inference — no need to adjust standard errors for first-step estimation of $p(X_i)$ or $m_\Delta^{d=0}(X_i)$

RA, IPW, and DR have different robustness and efficiency properties.

Do these properties hold in finite samples?

Monte Carlo: Comparing All Estimators

DGP with true ATT $= 0$:
- 4 DGPs: vary correct/incorrect outcome and PS models
- DGP 1: Both correctly specified
- DGP 2: PS misspecified, outcome correct
- DGP 3: PS correct, outcome misspecified
- DGP 4: Both misspecified
7 estimators: Oracle (infeasible), DR-Improved, DR-Traditional, IPW, IPW-Normalized, RA, TWFE
$n = 500$, $1{,}000$ MC replications

Monte Carlo: Sampling Distributions by DGP

DGP 1: Both Correct
DGP 2: PS Misspecified
DGP 3: OR Misspecified
DGP 4: Both Wrong

	DGP 1 Bias	DGP 1 RMSE	DGP 2 Bias	DGP 2 RMSE	DGP 3 Bias	DGP 3 RMSE	DGP 4 Bias	DGP 4 RMSE
	Both correct		PS wrong		OR wrong		Both wrong
TWFE	$-20.9$	$21.1$	$-20.5$	$20.6$	$-28.2$	$28.3$	$-16.4$	$16.5$
Regression	$0.0$	$0.1$	$0.0$	$0.1$	$-6.1$	$6.2$	$-5.2$	$5.3$
IPW (Hajek)	$0.0$	$1.2$	$-1.9$	$2.2$	$0.0$	$1.3$	$-4.0$	$4.2$
DR (Trad.)	$0.0$	$0.1$	$0.0$	$0.1$	$0.0$	$1.0$	$-3.2$	$3.5$
DR (Impr.)	$0.0$	$0.1$	$0.0$	$0.1$	$0.0$	$1.0$	$-1.0$	$2.6$

DR unbiased whenever at least one model is correct (DGPs 1–3)
TWFE severely biased in all DGPs — nonlinear $X$-dependence breaks linearity
DGP 4: DR has smaller bias (product of two misspecification errors) — but is not consistent

	DGP 1	DGP 2	DGP 3	DGP 4
	Both correct	PS wrong	OR wrong	Both wrong
TWFE	0.0%	0.0%	0.0%	0.0%
Regression	93.9%	94.9%	83.8%	1.1%
IPW (Hajek)	94.0%	83.7%	95.2%	22.0%
DR (Trad.)	95.3%	94.9%	94.6%	28.4%
DR (Impr.)	94.8%	94.4%	94.6%	26.8%

DR is the only estimator that performs well across all scenarios. TWFE should not be the default when covariates matter.

Two chances to get it right, and efficient when both models are correct.

Doubly robust is the default for DiD with covariates.

Act III: The Design Phase & Full Applications

Covariate Balance and the Plausibility of PT

If covariates that are important for outcome changes in the absence of treatment are unbalanced across treated and comparison groups, this raises serious concerns about unconditional PT (Abadie 2005)
Intuition: if groups differ in $X$, and $X$ drives $\Delta Y(\infty)$, then $\mathbb{E}[\Delta Y(\infty) \mid D=1] \neq \mathbb{E}[\Delta Y(\infty) \mid D=0]$
This motivates covariate balance diagnostics as part of any DiD analysis — following the broader principle that “design trumps analysis” (Rubin 2008; Baker et al. 2025)
Key diagnostics:
- Unweighted standardized differences: $\frac{\bar{X}_1 - \bar{X}_0}{\sqrt{(s_1^2 + s_0^2)/2}}$
- IPW-weighted standardized differences: does reweighting restore balance?
- Propensity score overlap: are there regions of $X$ with no comparison units?
Good balance $\Rightarrow$ more credible results, less sensitivity to specification choices

Medicaid: Context and Covariates

Setting: Effect of Medicaid expansion on county-level mortality (Lecture 5 data)
Now incorporate county-level covariates:
- % white, % Hispanic, % female
- Unemployment rate, poverty rate
- Median household income
Why covariates matter: Expansion states systematically differ from non-expansion states on these characteristics
Conditional PT more plausible than unconditional PT: “Among counties with similar demographics, mortality trends would be parallel absent expansion”

Medicaid: Covariate Imbalance (Recap)

Medicaid: Propensity Score Overlap

Medicaid: Covariate Balance Table (Population-Weighted)

Covariate	Treated	Comparison	Std. Diff.
% White	79.5	77.9	$+0.115$
% Hispanic	18.9	17.0	$+0.107$
% Female	50.1	50.5	$-0.238$
Unemp. Rate	8.0	7.0	$+0.503$
Poverty Rate	15.3	17.2	$-0.375$
Median Income ($K)	57.9	49.3	$+0.685$

Expansion counties are wealthier, higher unemployment, less poverty
Several covariates exceed $\pm 0.25$ threshold — unconditional PT is questionable

Medicaid: All Estimators Compared (Population-Weighted)

Medicaid: What We Learned

Wide confidence intervals — limited power with county-level data
None of the estimates are statistically significant
But the sensitivity of estimates to covariate inclusion is itself informative:
- If results change dramatically with covariates, conditional PT is substantively different from unconditional PT
- If results are stable, the unconditional DiD was already capturing the right comparison
Good overlap and balance — the “design” checks out
Covariates matter for credibility even if they do not dramatically change point estimates

Medicaid had 6 covariates and good overlap. What happens with a richer covariate set?

Brazil’s psychiatric reform: 30 covariates + state FE.

Brazil CAPS: Full Context

Dias and Fontes (2024): Brazil’s 2002 Psychiatric Reform created CAPS (community mental health centers) replacing psychiatric hospitals
Staggered rollout across 5,180 municipalities (2002–2016)
Our $2 \times 2$ setup: g = 2006 (early CAPS adopters) vs. never-treated; pre = 2005, post = 2007
Outcome: Assault homicide rate per 10,000 population
30 covariates: Demographics, income, transfers, poverty, geographic characteristics, health infrastructure (from 2000 census and administrative data) + state fixed effects
Surprising finding: CAPS adoption increases homicides — consistent with the Penrose hypothesis (see Dias and Fontes 2024) that deinstitutionalization reduces incapacitation

Brazil: Covariate Balance

Brazil: Propensity Score Overlap

Brazil: Overlap After Trimming Untreated

$\hat{p}(X) \in [0.01,\, 0.995]$
$\hat{p}(X) \in [0.025,\, 0.995]$
$\hat{p}(X) \in [0.05,\, 0.995]$

Trimming untreated with $\hat{p}(X_i) < 0.01$ or $> 0.995$ drops 58% of comparison (2,222/3,829)
All 216 treated units retained — trimming applies only to untreated
Still a large spike near 0 among remaining untreated

Trimming untreated at 0.025 drops 76% of comparison (2,916/3,829) — only 913 remain
Overlap improves, but treated distribution still much more spread out
Aggressive trimming changes the effective comparison group substantially

Trimming untreated at 0.05 drops 87% — only 493 of 3,829 remain
Overlap finally reasonable, but we lost most of the comparison group

did/DRDID default: trim untreated with $\hat{p}(X_i) > 0.995$ only; units with $\hat{p}(X_i) \approx 0$ self-trim via vanishing weights

Brazil: Results — All Estimators

Brazil: Event Study with Covariates

Key Insight: Covariates and Credibility

In both applications, covariates may not dramatically change point estimates
But they dramatically change credibility:
- Unconditional PT is a strong assumption when groups differ
- Showing that conditional PT gives similar results strengthens the case
- Showing that they differ reveals that the baseline was contaminated
The “design phase” (balance diagnostics) is crucial for transparency
Bottom line: Even when estimates are stable, the exercise of checking matters

In these data, covariates predict treatment adoption but DR and unconditional estimates are broadly similar — the exercise of checking is what matters.

We have seen the theory, diagnostics, and empirical results.

How do we implement this in practice?

Software: The `did` Package (Primary)

library(did)

# Callaway & Sant'Anna (2021) with covariates
# Uses doubly robust estimation by default
result <- att_gt(
  yname  = "l_homicide",
  tname  = "year",
  idname = "sid",
  gname  = "first_treat",
  xformla = ~ x1 + x2 + x3,
  data   = my_data,
  control_group = "notyettreated",
  est_method = "dr",          # DR is the default
  base_period = "universal"   # use first period as base
)

# Aggregate to event study
es <- aggte(result, type = "dynamic")

Software: The `DRDID` Package (Low-Level)

library(DRDID)

# Panel data: Doubly Robust DiD (improved)
result_dr <- drdid(yname = "y", tname = "post",
                   idname = "id", dname = "treat",
                   xformla = ~ x1 + x2 + x3,
                   data = panel_data, panel = TRUE)

# Also available: ipwdid(), ordid()
# For low-level functions, note the intercept convention:
#   drdid_imp_panel: needs cbind(1, X) explicitly
#   twfe_did_panel:  adds intercept internally (do NOT add)

The did package wraps DRDID and handles the intercept convention automatically. Use DRDID for more control.

Practitioner Checklist

Step-by-step guide for DiD with covariates:

Specify: Which pre-determined covariates make conditional PT plausible? Only condition on $X_i$ measured before treatment and not affected by it.
Check overlap: Plot propensity score distributions. Trim if needed.
Balance: Compare covariate means across treated and comparison groups.
Estimate: Use DR as the default. Report RA and IPW as robustness.
Sensitivity: How much do results change with/without covariates?
For repeated cross-sections: Test for compositional changes (see Act IV).

What if we observe repeated cross-sections instead of a panel?

New challenges: compositional changes and stationarity.

Act IV: Repeated Cross-Sections

Panel vs. Repeated Cross-Sections: Recap

Panel data: Observe same units in both periods
- Can compute $\Delta Y_i = Y_{i,t=2} - Y_{i,t=1}$ directly
- Only need one outcome model: $m_\Delta^{d=0}(x) = \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = 0]$
Repeated cross-sections (RCS): Different units sampled each period
- Cannot first-difference
- Need to model outcomes separately in each period
- Requires additional assumptions about the sampling process
From Lecture 5: panel data is strictly more efficient than RCS
With covariates, the gap between panel and RCS has additional nuances

RCS Sampling Assumption

Assumption (RCS Sampling). The pooled RCS data $\{Y_i, D_i, X_i, T_i\}_{i=1}^n$ are iid draws from: \[P(Y, D, X, T) = \lambda \cdot P(Y_2, D, X \mid T\!=\!2)\cdot\mathbf{1}\{T\!=\!2\} + (1\!-\!\lambda) \cdot P(Y_1, D, X \mid T\!=\!1)\cdot\mathbf{1}\{T\!=\!1\}\] where $\lambda = \mathbb{P}(T = 2) \in (0,1)$.

Each unit observed in exactly one period: $Y_i = \mathbf{1}\{T_i\!=\!2\} Y_{i,2} + \mathbf{1}\{T_i\!=\!1\} Y_{i,1}$
Panel: observe $(Y_{i,1}, Y_{i,2})$ for each unit; RCS: observe $Y_i$ for one period only
Define outcome regressions per cell: $m_t^{d}(x) = \mathbb{E}[Y_i \mid D_i = d, T_i = t, X_i = x]$
The conditional PT is stated for the superpopulation; the RCS sampling assumption ensures we can identify these quantities from the observed cross-sectional data.

Conditional PT for Repeated Cross-Sections

Conditional PT is the same as the panel case (Sant’Anna and Zhao 2020): $\mathbb{E}[\Delta Y_i(\infty) \mid D_i\!=\!1, X_i] = \mathbb{E}[\Delta Y_i(\infty) \mid D_i\!=\!0, X_i]$
What changes is the data structure, not the assumption:
- Different units sampled each period $\Rightarrow$ cannot first-difference
- Need $m_t^{d=0}(x) = \mathbb{E}[Y \mid D\!=\!0, T\!=\!t, X\!=\!x]$ separately for $t = 1, 2$
Key additional requirement: stationarity of $(D, X)$ across sampling periods (Sant’Anna and Zhao 2020, Assumption 2(b))
With panel data, stationarity is automatic (same units observed twice)

The Stationarity Assumption

Assumption (Stationarity / No Compositional Changes). The joint distribution of $(G_i, X_i)$ is the same across time periods: \[(G_i, X_i) \mid T_i = 1 \overset{d}{=} (G_i, X_i) \mid T_i = 2\]

In words: the “composition” of units sampled in each period is stable
Automatic in panel data (same units observed each period)
Not automatic in RCS: differential migration, attrition, survey redesigns can change who is sampled
Standard DR DiD estimators for RCS assume stationarity

Compositional Changes: A Visual

When the covariate distribution shifts between periods (migration, attrition, survey redesign), standard RCS estimators are biased.

IPW for Repeated Cross-Sections

With RCS, IPW must reweight across both treatment groups and time periods:

IPW for RCS (Abadie 2005): \[\text{ATT}^{ipw,rc} = \frac{1}{\mathbb{E}[D]}\;\mathbb{E}\!\left[\frac{D - p(X)}{1 - p(X)}\;\frac{T - \lambda}{\lambda(1-\lambda)}\; Y\right]\]

Compared to panel IPW, the RCS version uses an additional reweighting factor $\frac{T - \lambda}{\lambda(1-\lambda)}$ that adjusts for the time dimension
Still requires the same overlap condition: $p(X_i) < 1$ a.s.
The propensity score $p(X) = \mathbb{P}(D = 1 \mid X)$ is estimated on pooled data across both periods — this is valid under the no-compositional-changes assumption

Efficient DR DiD for Repeated Cross-Sections: The Estimand

The efficient DR estimand (Sant’Anna and Zhao 2020) models all four $(d,t)$ cells:

\[\begin{aligned} \text{ATT}^{dr,rc}_{eff} &= \underbrace{\mathbb{E}\Big[\tfrac{D}{\mathbb{E}[D]}\big(m_\Delta^{d=1}(X) - m_\Delta^{d=0}(X)\big)\Big]}_{\text{RA component}} \\[-0.2em] &\quad + \underbrace{\mathbb{E}\Big[w_{t=2}^{d=1}(Y - m_{t=2}^{d=1}(X)) - w_{t=1}^{d=1}(Y - m_{t=1}^{d=1}(X))\Big]}_{\text{treated bias correction}} \\[-0.2em] &\quad - \underbrace{\mathbb{E}\Big[w_{t=2}^{d=0}(Y - m_{t=2}^{d=0}(X)) - w_{t=1}^{d=0}(Y - m_{t=1}^{d=0}(X))\Big]}_{\text{comparison bias correction}} \end{aligned}\]

where $m_\Delta^{d}(x) = m_{t=2}^{d}(x) - m_{t=1}^{d}(x)$. Weights defined on the next slide.

RCS Efficient DR DiD: The Four Hajek Weights

Treated weights (simple — just normalize within $(d\!=\!1, t)$ cells): \[w_{t}^{d=1}(D_i, T_i) = \frac{D_i \cdot \mathbf{1}\{T_i\!=\!t\}}{\mathbb{E}[D \cdot \mathbf{1}\{T\!=\!t\}]} \quad \text{for } t = 1, 2\]

Comparison weights (reweight to match treated covariate distribution): \[w_{t}^{d=0}(D_i, T_i, X_i) = \frac{\frac{p(X_i)(1-D_i)\,\mathbf{1}\{T_i\!=\!t\}}{1-p(X_i)}}{\mathbb{E}\!\left[\frac{p(X)(1-D)\,\mathbf{1}\{T\!=\!t\}}{1-p(X)}\right]} \quad \text{for } t = 1, 2\]

Compare with the panel case: RCS needs four weights (one per $(d,t)$ cell) instead of two.

Compositional Changes: The Problem

Compositional changes: the distribution of $(G, X)$ differs across periods
Happens when:
- Migration: people move in/out of regions across survey waves
- Attrition: some types of units drop out differentially
- Survey redesign: sampling frame changes between waves
- Natural disasters, policy changes that affect who is “at risk”
When stationarity fails, standard RCS estimators are biased
The bias arises because the “comparison group trend” is contaminated by compositional shifts

Sant’Anna & Xu (2026): DiD with Compositional Changes

Sant’Anna and Xu (2026) propose estimators that do not require stationarity
Key innovation: rate double robustness
- Consistent at $\sqrt{n}$ rate if both nuisance functions converge, even if each converges at a slower rate
- Enables use of nonparametric/ML methods for nuisance estimation
Additional nuisance parameter: model for compositional changes
- How does $\mathbb{P}(T = 2 \mid X, D)$ vary with covariates?
- Captures differential sampling across periods

Sant’Anna & Xu (2026): Estimation Strategy

Nonparametric nuisance estimation: all nuisance functions ($m_t^d$, $p$, $\pi$) estimated nonparametrically — no need to assume linear/logistic models
DML-style procedures:
- Cross-fitting to avoid overfitting bias from flexible first-stage estimators
- Enables use of ML methods (random forests, neural nets, LASSO, etc.) for nuisance estimation while maintaining valid inference
Leave-one-out estimation:
- Each unit’s nuisance functions estimated without that unit’s own data
- Eliminates the “own observation” bias that arises with nonparametric estimators
- Particularly useful with kernel-based or local polynomial methods
These innovations ensure $\sqrt{n}$-consistent and asymptotically normal estimators even when nuisance functions are estimated at slower-than-$\sqrt{n}$ rates

DR DiD Robust to Compositional Changes

Sant’Anna and Xu (2026) derive a DR estimand that does not require stationarity:

\[\begin{aligned} \tau_{dr}^{cc} &= \mathbb{E}\Big[w_{t=2}^{d=1,cc}\big(m_\Delta^{d=1}(X) - m_\Delta^{d=0}(X)\big)\Big] \\[-0.2em] &\quad + \mathbb{E}\Big[w_{t=2}^{d=1,cc}(Y - m_{t=2}^{d=1}(X)) - w_{t=1}^{d=1,cc}(Y - m_{t=1}^{d=1}(X))\Big] \\[-0.2em] &\quad - \mathbb{E}\Big[w_{t=2}^{d=0,cc}(Y - m_{t=2}^{d=0}(X)) - w_{t=1}^{d=0,cc}(Y - m_{t=1}^{d=0}(X))\Big] \end{aligned}\]

Key: weights use generalized PS $\pi(d,t,x) = \mathbb{P}(D\!=\!d, T\!=\!t \mid X)$. Weights rebalance each $(d,t)$ cell to the treated-at-$t\!=\!2$ distribution.

Intuition: $\pi(1,2,X_i)/\pi(d,t,X_i)$ reweights each $(d,t)$ cell to match the treated post-treatment covariate distribution — the same IPW logic of “make comparison look like treated,” but now applied within each $(d,t)$ cell separately.

Hausman-Type Test for Compositional Changes

Key diagnostic: Compare estimators that use stationarity vs. those that do not
Under stationarity: both should give the same answer
Under compositional changes: they will diverge
Sant’Anna and Xu (2026) formalize this as a Hausman-type test: \[H_0: \text{stationarity holds} \quad \text{vs.} \quad H_1: \text{compositional changes}\]
Test statistic based on the difference between two DR estimators
Rejection $\Rightarrow$ use the estimator that allows for compositional changes

Application: South Africa–Mozambique Tariff Liberalization

Sant’Anna and Xu (2026) revisit Sequeira (2016)’s study of South Africa’s tariff liberalization on trade with Mozambique
Data: repeated cross-sections of trade flows across product categories
Compositional changes are plausible: product mix changes over time as trade patterns evolve
Results:
- Standard estimators (assuming stationarity): find large effects
- Compositional-change-robust estimators: qualitatively similar but different magnitudes
- Hausman test: rejects stationarity in some specifications

Key Takeaway: When to Worry about Compositional Changes

Panel data: stationarity holds by construction — not an issue
RCS data: always ask:
- Is the sampling frame the same across periods?
- Could migration, attrition, or policy changes affect who is observed?
- Do covariate distributions shift across periods?
Practical advice:
1. Compare covariate distributions across periods
2. Run the Hausman-type test
3. If stationarity is rejected, use compositional-change-robust estimators

Q: In your own research, how would you diagnose whether compositional changes are present?

What if we have many potential covariates?

From low to high dimensions: machine learning meets DiD.

Act V: Machine Learning & DiD

What You Need to Know About This Section

We now enter high-dimensional territory: many covariates, flexible estimation
This section sketches the key ideas; a full treatment requires substantially more time than we have available
The practical takeaway:
- Use LASSO/Post-LASSO for covariate selection
- Combine with cross-fitting to avoid overfitting bias
- Wrap in the DR framework for robustness
Focus on the recipe and the intuition — we will not derive convergence rates or prove oracle inequalities

If you plan to use these methods in your dissertation, see (Chernozhukov2018?) for the full DML framework and Belloni, Chernozhukov, and Hansen (2014) for Post-LASSO inference theory.

The High-Dimensional Challenge

So far: $X$ is low-dimensional and we specify parametric models (OLS, logit)
But what if we have many potential confounders?
- Administrative data with hundreds of variables
- Interactions, polynomials, transformations
- Researcher degrees of freedom in choosing which to include
Standard approach: include “all reasonable” covariates $\Rightarrow$ overfitting, instability
Can we do better?

Machine Learning to the Rescue?

Machine learning offers principled ways to handle high-dimensional $X$:
1. Select relevant covariates automatically (LASSO, elastic net)
2. Estimate flexible functional forms (random forests, boosting)
3. Avoid overfitting through regularization and cross-validation
But: naively plugging ML into DR creates an overfitting bias problem
Solution: cross-fitting — estimate nuisance functions on one sample, evaluate on another
This lecture: LASSO $\rightarrow$ cross-fitting $\rightarrow$ DML-DiD

Why Not Just Use All Covariates?

With $k$ covariates and $n$ observations:
- OLS/logit requires $k < n$ (infeasible when $k$ is large)
- Even when $k < n$, overfitting degrades predictions
- Variance of fitted values grows with $k$
The bias-variance tradeoff:
- Underfitting (too few covariates): misspecification bias
- Overfitting (too many covariates): high variance, poor out-of-sample prediction
We need methods that regularize: shrink or select to control complexity
Key insight: the DR structure provides a natural framework for ML integration

Review: The DR Estimand Structure

Recall the DR DiD estimand (panel data):

\[\text{ATT}^{dr} = \mathbb{E}\left[(w_1(D) - w_0(D, X; p))(\Delta Y - m_\Delta^{d=0}(X))\right]\]

Two nuisance functions to estimate:
1. Outcome model: $m_\Delta^{d=0}(X) = \mathbb{E}[\Delta Y \mid X, D = 0]$
2. Propensity score: $p(X) = \mathbb{P}(D = 1 \mid X)$
The DR property means: bias from estimating these nuisance functions is a product of their respective errors
This is exactly the right structure for ML: we can use flexible methods for nuisance estimation while maintaining valid inference for the ATT

LASSO: A Primer

LASSO (Tibshirani 1996) (Least Absolute Shrinkage and Selection Operator):

\[\hat{\beta}^{lasso} = \arg\min_\beta \frac{1}{n}\sum_{i=1}^n (Y_i - X_i'\beta)^2 + \lambda \sum_{j=1}^k |\beta_j|\]

The $\ell_1$ penalty $\lambda \|\beta\|_1$ serves dual purpose:
1. Shrinkage: Pulls coefficients toward zero (reduces variance)
2. Selection: Sets some coefficients exactly to zero (selects variables)
Useful when the true model is sparse: only $s \ll k$ covariates truly matter
The tuning parameter $\lambda$ controls the bias-variance tradeoff

Approximate Sparsity

Exact sparsity: Only $s$ coefficients are nonzero (restrictive)
Approximate sparsity: Many small coefficients, but the best $s$-sparse approximation is close to the truth
More realistic: covariates may all contribute, but most only marginally
LASSO works well under approximate sparsity: it automatically finds the most important variables and provides a good approximation
Key rate requirement: $s^2 \log(k) / n \to 0$ (sparsity grows slowly relative to $n$)

Post-LASSO: Correcting for Shrinkage Bias

LASSO shrinks coefficients toward zero $\Rightarrow$ downward bias in fitted values
Post-LASSO (Belloni and Chernozhukov 2013; Belloni et al. 2017): Two-step procedure
1. Run LASSO to select variables (identify $\hat{S} = \{j : \hat{\beta}_j^{lasso} \neq 0\}$)
2. Run OLS using only the selected variables $X_{\hat{S}}$ (no penalty)
Post-LASSO reduces bias while maintaining the sparsity-driven variable selection
Achieves the same rate of convergence as LASSO but with better constants

LASSO for the Propensity Score

Model: $p(X_i) = \Lambda(X_i'\gamma)$ with logistic LASSO

\[\hat{\gamma}^{lasso} = \arg\min_\gamma \;-\frac{1}{n}\sum_{i=1}^n \ell_i(\gamma) + \lambda \|\gamma\|_1, \quad \ell_i(\gamma) = D_i \log \Lambda(X_i'\gamma) + (1\!-\!D_i)\log(1 \!-\! \Lambda(X_i'\gamma))\]

Post-LASSO logit: Select variables, then refit logit on selected set
Key concern: logistic LASSO gives good prediction but not necessarily good balancing
Solution: combine with DR estimation (bias correction handles PS errors)

LASSO for the Outcome Model

Model: $m_\Delta^{d=0}(X_i) = X_i'\beta$ with linear LASSO

\[\hat{\beta}^{lasso} = \arg\min_\beta \frac{1}{n_0}\sum_{i: D_i=0} (\Delta Y_i - X_i'\beta)^2 + \lambda \|\beta\|_1\]

Estimated using comparison group only (same as RA)
Post-LASSO OLS: Select variables, then refit OLS on selected set
LASSO automatically discovers which covariates predict $\Delta Y$ among untreated

The Overfitting Problem: Why Naive ML Can Fail

Plugging ML-estimated nuisance functions into the DR formula can yield valid inference, but requires Donsker-type conditions on the nuisance function classes, plus stronger rate/smoothness/sparsity requirements
ML estimators (LASSO, random forests, neural nets) typically violate these conditions — their complexity grows with $n$
Without these conditions: using the same data to (i) fit nuisance functions and (ii) evaluate the DR formula creates regularization bias that may not vanish at $\sqrt{n}$ rate
Intuition: ML adapts to noise in the training data, and this noise “leaks” into the DR estimator
Solution: Sample splitting / cross-fitting (Chernozhukov et al. 2018) — avoids Donsker conditions entirely

Cross-Fitting: A Visual

Cross-Fitting: The DML Framework

Double/Debiased Machine Learning (DML) (Chernozhukov et al. 2018):

Split sample into $K$ folds (e.g., $K = 5$)
For each fold $k$:
- Estimate nuisance functions ($\hat{m}_\Delta^{d=0}$, $\hat{p}$) on all folds except $k$
- Evaluate the DR formula on fold $k$ using these estimates
Average across folds to get the final estimate

Key property: Estimation and evaluation use different data $\Rightarrow$ no overfitting bias
Allows $\sqrt{n}$-consistent and asymptotically normal ATT estimates even with slow-converging ML nuisance estimators (requires product of estimation errors $\|\hat{m} - m_0\| \cdot \|\hat{p} - p_0\| = o_p(n^{-1/2})$)
Works with any ML method (LASSO, random forests, neural networks, …)

DML Algorithm for DR-DiD

Input: $\{Y_{i,t=1}, Y_{i,t=2}, D_i, X_i\}_{i=1}^n$, $K$ folds

Partition: Split $\{1, \ldots, n\}$ into $K$ folds $I_1, \ldots, I_K$
For fold $k$: Estimate $\hat{m}_\Delta^{d=0,(-k)}$, $\hat{p}^{(-k)}$ on data outside $k$; compute $\hat{\theta}_k = \frac{1}{|I_k|}\sum_{i \in I_k} \hat{\psi}_i^{dr}$
Aggregate: $\hat{\theta}^{DML} = \frac{1}{K}\sum_{k=1}^K \hat{\theta}_k$

Inference: IF-based plug-in variance or multiplier bootstrap (Belloni et al. 2017; Chernozhukov et al. 2018)

We have the tools: LASSO + cross-fitting + DR.

How well does DML-DiD perform in finite samples?

Monte Carlo: DML-DiD

Three DGPs with $p = 100$ covariates ($s = 5$ active), $n = 500$:
1. DGP 1: Unconditional PT valid (covariates irrelevant)
2. DGP 2: Conditional PT + homogeneous ATT
3. DGP 3: Conditional PT + heterogeneous ATT
Compare: LASSO-DR (with cross-fitting), Linear DR, TWFE
200 MC replications (ML is computationally intensive)

DML-DiD: Sampling Distributions by DGP

DGP 1: Uncond. PT
DGP 2: Cond. PT, Homog.
DGP 3: Cond. PT, Heterog.

DML-DiD: Summary

Bias & RMSE
Coverage (95% CI)

	DGP 1 Bias	DGP 1 RMSE	DGP 2 Bias	DGP 2 RMSE	DGP 3 Bias	DGP 3 RMSE
	Uncond. PT		Cond. PT, homog.		Cond. PT, heterog.
Unconditional DiD	$-0.1$	$12.3$	$64.0$	$160.1$	$67.0$	$148.2$
RA + LASSO	$-0.1$	$12.3$	$2.9$	$14.9$	$-3.2$	$16.3$
IPW + LASSO	$0.4$	$14.2$	$-19.2$	$157.1$	$-16.2$	$155.4$
DR + LASSO	$0.4$	$14.0$	$0.9$	$14.7$	$-5.1$	$16.8$
DR + Causal Forest	$-0.2$	$12.5$	$44.1$	$76.0$	$41.1$	$70.9$

DGP 1: all estimators work — unconditional PT holds, covariates irrelevant
DGPs 2–3: DR + LASSO and RA + LASSO dominate; unconditional DiD severely biased
IPW + LASSO unstable: propensity score estimation alone insufficient with many covariates

	DGP 1	DGP 2	DGP 3
	Uncond. PT	Cond. PT, homog.	Cond. PT, heterog.
Unconditional DiD	96.5%	91.0%	94.5%
RA + LASSO	96.5%	92.5%	93.0%
IPW + LASSO	95.0%	96.0%	96.0%
DR + LASSO	95.0%	94.5%	92.0%
DR + Causal Forest	96.0%	87.5%	92.0%

DR + LASSO with cross-fitting provides the best bias-variance tradeoff. Causal forests can estimate heterogeneous effects but are less reliable for average ATT.

DML-DiD estimates the average ATT with high-dimensional nuisance functions.

What if treatment effects vary across units?

Causal Forests: Heterogeneous Treatment Effects

Beyond estimating the average ATT: what about $\text{CATT}(x)$?
Generalized Random Forests (GRF) (Athey, Tibshirani, and Wager 2019):
- Nonparametrically estimate conditional treatment effects
- Provide valid pointwise confidence intervals
- Handle high-dimensional covariates
Causal forests for DiD:
- Outcome: $\Delta Y_i$ (first difference)
- Treatment: $D_i$
- Covariates: $X_i$
- GRF estimates $\text{CATT}(x) = \mathbb{E}[Y_{i,t=2}(2) - Y_{i,t=2}(\infty) \mid X_i = x, D_i = 1]$
Combines the DR framework with forest-based heterogeneity estimation

Open Question: Covariates for Identification vs. Interest

Suppose you care about treatment effects conditional on a small subset of covariates — say, gender, race, or income — but need many more covariates to justify conditional parallel trends. How should you proceed?

This connects to $\text{CATT}(x)$ — but the “$x$” you care about is low-dimensional, while the “$X$” for identification is high-dimensional
Tension: averaging over nuisance covariates while conditioning on covariates of interest
We leave this as an open question for you to think about

ML can estimate nuisance functions and uncover heterogeneity.

When does all this machinery actually help?

When Does ML Help?

ML is useful when:

Many covariates ($k > 20$)
Unknown functional forms
Complex interactions matter
Heterogeneity exploration
Administrative/big data

ML is overkill when:

Few covariates ($k < 10$)
Clear linear relationships
Simple additive effects
Only average effect needed
Small samples ($n < 500$)

ML is a tool, not a magic bullet. Use it when the complexity of the nuisance functions justifies the additional machinery. For simple settings, parametric DR is simpler.

Honest About Limitations

ML does not fix identification problems:
- If conditional PT does not hold, no amount of ML helps
- ML estimates nuisance functions better, not the causal assumptions
Computational cost: cross-fitting with LASSO/forests is slower than OLS
Interpretability: harder to understand what’s driving the estimates
Finite-sample performance: ML guarantees are asymptotic; with $n = 200$, parametric methods often work better
Bottom line: ML extends the toolkit, but the hard work is still in the identification assumptions and study design

Three estimation strategies (RA, IPW, DR), RCS extensions, and ML integration.

What should practitioners take away from all this?

Taking Stock

What We Accomplished Today

Key takeaways from DiD with covariates:

Conditional PT is often more credible than unconditional PT — but requires appropriate estimation methods
TWFE with covariates is fragile: imposes no covariate-specific trends, hidden linearity bias, potentially negative weights
Doubly robust is the default: consistent if either outcome model or propensity score is correct; efficient when both are
Panel > RCS: strictly more efficient, no stationarity concerns. For RCS, test for compositional changes
ML extends DR naturally: LASSO + cross-fitting for high-dimensional settings; causal forests for heterogeneity

References

Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” The Review of Economic Studies 72 (1): 1–19.

Angrist, Joshua D., and Jorn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton: Princeton University Press.

Athey, Susan, Julie Tibshirani, and Stefan Wager. 2019. “Generalized random forests.” The Annals of Statistics 47 (2): 1148–78.

Baker, Andrew, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna. 2025. “Difference-in-Differences Designs: A Practitioner’s Guide.” Journal of Economic Literature Forthcoming.

Belloni, Alexandre, and Victor Chernozhukov. 2013. “Least Squares After Model Selection in High-Dimensional Sparse Models.” Bernoulli 19 (2): 521–47.

Belloni, Alexandre, Victor Chernozhukov, Iván Fernández-Val, and Christian Hansen. 2017. “Program Evaluation and Causal Inference With High-Dimensional Data.” Econometrica 85 (1): 233–98.

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “Inference on Treatment Effects After Selection Among High-Dimensional Controls.” The Review of Economic Studies 81 (2): 608–50.

Caetano, Carolina, and Brantly Callaway. 2024. “Difference-in-Differences with Time-Varying Covariates in the Parallel Trends Assumption.”

Callaway, Brantly, and Pedro H. C. Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.

Cheng, Cheng, and Mark Hoekstra. 2013. “Does Strengthening Self-Defense Law Deter Crime or Escalate Violence? Evidence from Expansions to Castle Doctrine.” Journal of Human Resources 48 (3): 821–54.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal 21 (1): C1--C68.

Crump, Richard K, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. 2009. “Dealing with Limited Overlap in Estimation of Average Treatment Effects.” Biometrika 96 (1): 187–99.

Dias, Mateus, and Luiz Felipe Fontes. 2024. “The Effects of a Large-Scale Mental Health Reform: Evidence from Brazil.” American Economic Journal: Economic Policy 16 (3): 257–89. https://doi.org/10.1257/pol.20210746.

Graham, Bryan, Cristine Pinto, and Daniel Egel. 2012. “Inverse Probability Tilting for Moment Condition Models with Missing Data.” The Review of Economic Studies 79 (3): 1053–79.

Heckman, James J., Hidehiko Ichimura, and Petra Todd. 1997. “Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme.” The Review of Economic Studies 64 (4): 605–54.

Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd. 1998. “Characterizing Selection Bias Using Experimental Data.” Econometrica 66 (5): 1017–98.

Kang, Joseph D. Y., and Joseph L. Schafer. 2007. “Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data.” Statistical Science 22 (4): 523–39.

Khan, Shakeeb, and Elie Tamer. 2010. “Irregular Identification, Support Conditions, and Inverse Weight Estimation.” Econometrica 78 (6): 2021–42.

Rubin, Donald B. 2008. “For Objective Causal Inference, Design Trumps Analysis.” The Annals of Applied Statistics 2 (3): 808–40.

Sant’Anna, Pedro H. C., and Qi Xu. 2026. “Difference-in-Differences with Compositional Changes.” Working Paper.

Sant’Anna, Pedro H. C., and Jun Zhao. 2020. “Doubly Robust Difference-in-Differences Estimators.” Journal of Econometrics 219 (1): 101–22.

Sequeira, Sandra. 2016. “Corruption, Trade Costs, and Gains from Tariff Liberalization: Evidence from Southern Africa.” American Economic Review 106 (10): 3029–63.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.

	DGP 1 Bias	DGP 1 RMSE	DGP 2 Bias	DGP 2 RMSE	DGP 3 Bias	DGP 3 RMSE	DGP 4 Bias	DGP 4 RMSE
	Both correct		PS wrong		OR wrong		Both wrong
TWFE	\(-20.9\)	\(21.1\)	\(-20.5\)	\(20.6\)	\(-28.2\)	\(28.3\)	\(-16.4\)	\(16.5\)
Regression	\(0.0\)	\(0.1\)	\(0.0\)	\(0.1\)	\(-6.1\)	\(6.2\)	\(-5.2\)	\(5.3\)
IPW (Hajek)	\(0.0\)	\(1.2\)	\(-1.9\)	\(2.2\)	\(0.0\)	\(1.3\)	\(-4.0\)	\(4.2\)
DR (Trad.)	\(0.0\)	\(0.1\)	\(0.0\)	\(0.1\)	\(0.0\)	\(1.0\)	\(-3.2\)	\(3.5\)
DR (Impr.)	\(0.0\)	\(0.1\)	\(0.0\)	\(0.1\)	\(0.0\)	\(1.0\)	\(-1.0\)	\(2.6\)

	DGP 1 Bias	DGP 1 RMSE	DGP 2 Bias	DGP 2 RMSE	DGP 3 Bias	DGP 3 RMSE
	Uncond. PT		Cond. PT, homog.		Cond. PT, heterog.
Unconditional DiD	\(-0.1\)	\(12.3\)	\(64.0\)	\(160.1\)	\(67.0\)	\(148.2\)
RA + LASSO	\(-0.1\)	\(12.3\)	\(2.9\)	\(14.9\)	\(-3.2\)	\(16.3\)
IPW + LASSO	\(0.4\)	\(14.2\)	\(-19.2\)	\(157.1\)	\(-16.2\)	\(155.4\)
DR + LASSO	\(0.4\)	\(14.0\)	\(0.9\)	\(14.7\)	\(-5.1\)	\(16.8\)
DR + Causal Forest	\(-0.2\)	\(12.5\)	\(44.1\)	\(76.0\)	\(41.1\)	\(70.9\)

Unit \(i\)	\(D_i\)	\(X_i\) (poverty %)	\(\Delta Y_i\) (mortality change)
1	0 (comparison)	12	\(-2.0\)
2	0 (comparison)	18	\(-0.5\)
3	0 (comparison)	22	\(+1.0\)
4	1 (treated)	20	\(-3.0\)
5	1 (treated)	15	\(-2.5\)

ECON 730: Causal Inference with Panel Data

Roadmap

Act I: Why Covariates?

Where We Left Off

Meet the Applications

Medicaid: How Different Are the Groups?

Medicaid: Covariate Balance

Medicaid: Raw Trends

Brazil CAPS: Covariate Imbalance Across Dimensions

Brazil CAPS: Standardized Differences

Brazil CAPS: Raw Trends

The Core Idea: Conditional Parallel Trends

Notation: Review from Lecture 5

Notation: New for Covariates

Conditional Parallel Trends Assumption

The Overlap Assumption

Conditional vs. Unconditional PT: A Visual

How Conditional PT Can Break Unconditional PT

Identification of ATT under Conditional PT

The Practitioner’s Instinct: Add \(X\) to TWFE

What Goes Wrong with TWFE + Covariates

TWFE Imposes No Covariate-Specific Trends

TWFE Bias: Monte Carlo Evidence

TWFE Bias: Density Comparison

The FE Specification and Its Hidden Transformation

Three Sources of Bias (Caetano and Callaway, 2024)

Back to Brazil: All Estimators Compared

Act II: Three Estimation Strategies

Three Faces of DiD with Covariates

The First Face of DiD with Covariates: Regression Adjustment

Regression Adjustment: Estimation

Worked Example: RA with 5 Units

RA: Key Properties

RA in Practice: What Are We Actually Estimating?

The Second Face: Inverse Probability Weighting

IPW: Normalized (Hajek) Weights

IPW Reweighting: The Intuition

IPW: Estimation and Key Properties

IPW in Practice: What Are We Actually Reweighting?

RA vs. IPW: Complementary Strengths

The Third Face: Doubly Robust Estimation

Why Is It Doubly Robust?

Double Robustness: A Scorecard

DR in Practice: Two Models, Two Chances

The Semiparametric Efficiency Bound

Improved vs. Traditional DR

Monte Carlo: Comparing All Estimators

Monte Carlo: Sampling Distributions by DGP

Monte Carlo Summary

Act III: The Design Phase & Full Applications

Covariate Balance and the Plausibility of PT

Medicaid: Context and Covariates

Medicaid: Covariate Imbalance (Recap)

Medicaid: Propensity Score Overlap

Medicaid: Covariate Balance Table (Population-Weighted)

Medicaid: All Estimators Compared (Population-Weighted)

Medicaid: What We Learned

Brazil CAPS: Full Context

Brazil: Covariate Balance

Brazil: Propensity Score Overlap

Brazil: Overlap After Trimming Untreated

Brazil: Results — All Estimators

Brazil: Event Study with Covariates

Key Insight: Covariates and Credibility

Software: The did Package (Primary)

Software: The DRDID Package (Low-Level)

Practitioner Checklist

Act IV: Repeated Cross-Sections

Panel vs. Repeated Cross-Sections: Recap

RCS Sampling Assumption

Conditional PT for Repeated Cross-Sections

The Stationarity Assumption

Compositional Changes: A Visual

IPW for Repeated Cross-Sections

Efficient DR DiD for Repeated Cross-Sections: The Estimand

RCS Efficient DR DiD: The Four Hajek Weights

Compositional Changes: The Problem

Sant’Anna & Xu (2026): DiD with Compositional Changes

Sant’Anna & Xu (2026): Estimation Strategy

DR DiD Robust to Compositional Changes

Software: The `did` Package (Primary)

Software: The `DRDID` Package (Low-Level)