Lecture 6: Incorporating Covariates into DiD
Emory University
Spring 2026
I
Why Covariates?
Applications, conditional PT, TWFE fragility
II
Three Strategies
RA, IPW, doubly robust
III
Design & Applications
Balance, Medicaid, Brazil CAPS
IV
Repeated Cross-Sections
Compositional changes
V
ML & DiD
LASSO, cross-fitting, causal forests
Building on Lecture 5 (\(2 \times 2\) DiD without covariates)
Lecture 5 established the \(2 \times 2\) DiD framework:
But is unconditional parallel trends realistic?
What if treated and comparison units differ systematically in pre-treatment characteristics?
Application 1: ACA Medicaid Expansion (Baker et al. 2025)
Application 2: Brazil Psychiatric Reform (Dias and Fontes 2024)
Both applications: treated and comparison groups differ in pre-treatment characteristics
Question: Can we still use unconditional PT?
Treated states (expanded Medicaid in 2014) vs. comparison states (never expanded)
Let’s look at pre-treatment characteristics:
Do these groups look comparable?
Sometimes the unconditional PT assumption is too strong
But PT may be plausible within subgroups defined by pre-treatment characteristics \(X_i\)
Intuition: “Among counties with similar demographics, treated and comparison counties would have trended similarly absent treatment”
This is the conditional parallel trends assumption
Key insight: Covariates can make the PT assumption more credible, but we need appropriate estimation methods to exploit them.
Two periods: \(t = 1\) (pre-treatment) and \(t = 2\) (post-treatment)
Two groups: \(G_i \in \{2, \infty\}\), with \(D_i = \mathbf{1}\{G_i = 2\}\)
Potential outcomes: \(Y_{i,t}(g)\) for each treatment timing \(g\)
Target parameter:
\[\text{ATT} = \mathbb{E}[Y_{i,t=2}(2) - Y_{i,t=2}(\infty) \mid G_i = 2]\]
\(X_i\): vector of pre-treatment covariates (observed before treatment)
\(p(X_i) \equiv \mathbb{P}(D_i = 1 \mid X_i)\): generalized propensity score
\(p \equiv \mathbb{P}(D_i = 1) = \mathbb{E}[D_i]\): unconditional treatment probability
\(T_i \in \{1, 2\}\): period indicator for unit \(i\) (in panel data, each unit observed in both; in RCS, \(T_i\) is the sampling period)
Later we will also define:
PT may hold only within covariate subgroups:
Assumption (Conditional PT). \(\mathbb{E}[\Delta Y_i(\infty) \mid X_i, D_i\!=\!1] = \mathbb{E}[\Delta Y_i(\infty) \mid X_i, D_i\!=\!0]\) a.s.
For identification, we also need treated units to have comparable controls at every covariate value:
Assumption (Strong Overlap). For some \(\epsilon > 0\), \(\mathbb{P}(D_i = 1 \mid X_i) < 1 - \epsilon\) almost surely.
Every treated unit must have comparison units with similar covariate values
Without overlap: we cannot learn about the counterfactual for some treated units
For identification: can relax to \(\epsilon = 0\) (boundary case)
For standard inference: need \(\epsilon > 0\) to avoid irregularity (Khan and Tamer 2010)
Closely related to overlap conditions in the matching/weighting literature (Crump et al. 2009)
Note: for ATT, we only need \(p(X_i)\) bounded away from 1, not from 0 — unlike ATE, which requires both bounds.
Solid = observed comparison trend. Dashed = PT counterfactual for the treated group. Within subgroups, these are parallel.
Assumptions: A1. SUTVA A2. No Anticipation: \(Y_{i,t=1}(2) = Y_{i,t=1}(\infty)\) A3. Conditional PT A4. Overlap
Step 1: Identify the conditional ATT: \[\text{CATT}(x) = \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = 1] - \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = 0]\] where \(\Delta Y_i \equiv Y_{i,t=2} - Y_{i,t=1}\).
Step 2: Integrate over the treated covariate distribution: \[\text{ATT} = \mathbb{E}[\text{CATT}(X_i) \mid D_i = 1]\]
The most common approach in applied work: “just add covariates to the regression” \[Y_{i,t} = \alpha_i + \lambda_t + \tau D_{it} + X_{i,t}'\beta + \varepsilon_{i,t}\]
Recall from Lecture 5: without covariates, TWFE \(=\) DiD-by-hand in \(2 \times 2\)
Many practitioners expect the same logic extends: “\(\hat{\tau}\) should estimate the ATT after controlling for \(X\)”
Adding \(X\) to TWFE is not the same as allowing for covariate-specific trends. The regression imposes strong — and often hidden — restrictions.
Consider the TWFE specification with pooled data: \[Y_{i,t} = \tilde{\alpha}_0 + \tilde{\gamma}_0 D_i + \tilde{\lambda}_0 \mathbf{1}\{T_i\!=\!2\} + \textcolor{red}{\tilde{\beta}_0^{twfe}} \big(D_i \cdot \mathbf{1}\{T_i\!=\!2\}\big) + X_i' \tilde{\alpha}_1 + \tilde{\varepsilon}_{i,t}\]
The implied conditional means:
\[\begin{aligned} \mathbb{E}[Y \mid D\!=\!0, T\!=\!1, X] &= \tilde{\alpha}_0 + X'\tilde{\alpha}_1, \quad \mathbb{E}[Y \mid D\!=\!0, T\!=\!2, X] = \tilde{\alpha}_0 + \tilde{\lambda}_0 + X'\tilde{\alpha}_1 \\ \mathbb{E}[Y \mid D\!=\!1, T\!=\!1, X] &= \tilde{\alpha}_0 + \tilde{\gamma}_0 + X'\tilde{\alpha}_1, \quad \mathbb{E}[Y \mid D\!=\!1, T\!=\!2, X] = \tilde{\alpha}_0 + \tilde{\gamma}_0 + \tilde{\lambda}_0 + \textcolor{red}{\tilde{\beta}_0^{twfe}} + X'\tilde{\alpha}_1 \end{aligned}\]
TWFE forces homogeneous trends — but how bad is the bias in practice?
A controlled simulation where ATT \(= 0\).
Data generating process from Sant’Anna and Zhao (2020):
TWFE regression: \(Y_{i,t} = \alpha + \gamma D_i + \lambda \mathbf{1}\{T_i\!=\!2\} + \tau \big(D_i \cdot \mathbf{1}\{T_i\!=\!2\}\big) + X_i'\beta + \varepsilon_{i,t}\)
Results (\(n = 1{,}000\), 1,000 MC replications):
The simulation used time-invariant \(X_i\) in a pooled regression. What if covariates vary over time and we use fixed effects?
Caetano & Callaway (2024): a formal decomposition.
Caetano and Callaway (2024) analyze the standard fixed effects specification: \[Y_{i,t} = \theta_t + \eta_i + \alpha D_{it} + X_{i,t}'\beta + e_{i,t}\]
Compare with the pooled specification on the previous slides, which controls for \(X_i\) in levels but forces homogeneous time trends.
When conditional PT holds but TWFE is used with covariates:
\[\tilde{\beta}_0^{twfe} = \underbrace{\mathbb{E}[w(\Delta X) \cdot \text{ATT}(X) \mid D\!=\!1]}_{\text{weighted ATT}} + \underbrace{\text{BIAS}_A}_{\substack{\text{time-}\\\text{invariant}}} + \underbrace{\text{BIAS}_B}_{\substack{\text{levels vs.}\\\text{changes}}} + \underbrace{\text{BIAS}_C}_{\text{nonlinearity}}\]
In Act II, we introduce three estimation strategies (RA, IPW, DR) that avoid all three biases by separating identification from estimation.
C&C showed TWFE with covariates introduces three sources of bias. How do these biases play out in real data?
Back to the Brazil application.
TWFE with covariates is fragile.
We need better tools: separate identification from estimation.
Idea: Model the comparison group’s outcome evolution \(\mathbb{E}[\Delta Y_i \mid X_i, D_i = 0]\), then impute for treated units
With panel data, the ATT simplifies to:
\[\text{ATT} = \mathbb{E}[\Delta Y_i \mid D_i = 1] - \mathbb{E}\left[m_\Delta^{d=0}(X_i) \mid D_i = 1\right] = \mathbb{E}\left[m_\Delta^{d=1}(X_i) - m_\Delta^{d=0}(X_i) \mid D_i = 1\right]\]
where \(m_\Delta^{d=0}(x) \equiv \mathbb{E}[\Delta Y_i \mid X_i = x, D_i = 0]\)
Key difference from TWFE: regression is estimated on the comparison group only, then predictions are made for treated units. This allows covariate-specific trends.
| Unit \(i\) | \(D_i\) | \(X_i\) (poverty %) | \(\Delta Y_i\) (mortality change) |
|---|---|---|---|
| 1 | 0 (comparison) | 12 | \(-2.0\) |
| 2 | 0 (comparison) | 18 | \(-0.5\) |
| 3 | 0 (comparison) | 22 | \(+1.0\) |
| 4 | 1 (treated) | 20 | \(-3.0\) |
| 5 | 1 (treated) | 15 | \(-2.5\) |
The key: comparison group regression tells us “what would have happened to treated units if they hadn’t been treated.”
Consistent when the outcome model \(m_\Delta^{d=0}(x)\) is correctly specified
Inconsistent when \(m_\Delta^{d=0}(x)\) is misspecified
Works well when:
RA relies entirely on the researcher’s ability to model the comparison group’s outcome evolution.
Q: If RA is inconsistent under misspecification, why not always use a very flexible model for \(m_\Delta^{d=0}(x)\)?
Medicaid Expansion
Brazil CAPS Reform
The RA recipe (same in both applications):
\[\widehat{\text{ATT}}^{ra} = \frac{1}{n_1}\sum_{i:\,D_i=1}\bigg(\underbrace{\Delta Y_i}_{\text{observed}} - \underbrace{\hat{m}_\Delta^{d=0}(X_i)}_{\text{counterfactual}}\bigg)\]
RA models outcomes directly.
What if we instead reweight observations?
\[\text{ATT}^{ipw} = \frac{\mathbb{E}\left[\left(D_i - \frac{(1-D_i) p(X_i)}{1 - p(X_i)}\right) \Delta Y_i\right]}{\mathbb{E}[D_i]}\]
\[\text{ATT}^{ipw}_{std} = \mathbb{E}\bigg[\bigg(\underbrace{\frac{D_i}{\mathbb{E}[D_i]}}_{w_1(D_i)} - \underbrace{\frac{\frac{p(X_i)\,(1-D_i)}{1-p(X_i)}}{\mathbb{E}\!\left[\frac{p(X_i)\,(1-D_i)}{1-p(X_i)}\right]}}_{w_0(D_i, X_i)}\bigg) \Delta Y_i\bigg]\]
IPW uses \(w = p(X)/(1-p(X))\) to reweight comparison units so their covariate distribution matches the treated group.
Working model: Logistic \(p(X_i; \gamma_0) = \Lambda(X_i'\gamma_0)\)
Step 1: Estimate \(\gamma_0\) by logit MLE
Step 2: Plug in \(\hat{p}(X_i)\) and compute weighted averages
Influence function accounts for estimation error in \(\hat{\gamma}_n\)
Consistent when propensity score is correctly specified
Inconsistent when propensity score is misspecified — even if outcome model is known!
Overlap is critical: if \(p(X_i) \approx 1\), weights explode
Medicaid Expansion
Brazil CAPS Reform
The IPW recipe: (1) Estimate \(\hat{p}(X_i)\) using both treated and comparison units, (2) Reweight comparison units by \(\hat{p}(X_i)/(1 - \hat{p}(X_i))\), (3) Compute ATT as weighted difference:
\[\widehat{\text{ATT}}^{ipw} = \bar{\Delta Y}_{D=1} - \frac{\sum_{i:D_i=0} \hat{w}_i \cdot \Delta Y_i}{\sum_{i:D_i=0} \hat{w}_i}\]
| RA | IPW | |
|---|---|---|
| Models | Outcome evolution | Treatment assignment |
| Consistent when | \(m_\Delta^{d=0}(x)\) correct | \(p(x)\) correct |
| Fails when | Outcome misspecified | PS misspecified |
| Sensitive to | Functional form | Overlap violations |
RA and IPW have complementary failure modes. Can we combine them for robustness against either misspecification?
RA and IPW each rely on one model being correct.
Can we combine them for robustness against either* type of misspecification?*
Key idea: Combine outcome modeling (RA) with reweighting (IPW). Consistent if either the outcome model or the propensity score is correctly specified (but not necessarily both).
DR DiD Estimand (Sant’Anna and Zhao 2020): \(\displaystyle \text{ATT}^{dr} = \mathbb{E}\!\left[\Big(w_1(D_i) - w_0(D_i, X_i)\Big)\Big(\Delta Y_i - m_\Delta^{d=0}(X_i)\Big)\right]\)
Treated weight: \(w_1(D_i) = \frac{D_i}{\mathbb{E}[D_i]}\)
Comparison weight: \(w_0(D_i, X_i) = \frac{\frac{p(X_i)\,(1-D_i)}{1-p(X_i)}}{\mathbb{E}\!\left[\frac{p(X_i)\,(1-D_i)}{1-p(X_i)}\right]}\)
The DR estimand has two equivalent decompositions:
\[\begin{aligned} \text{ATT}^{dr} &= \underbrace{\text{ATT}^{ipw}_{std}}_{\text{IPW}} - \underbrace{\mathbb{E}\!\left[\left(w_1(D_i) - w_0(D_i, X_i)\right) m_\Delta^{d=0}(X_i)\right]}_{\text{Outcome-based bias correction}} \\[0.2em] &= \underbrace{\text{ATT}^{ra}}_{\text{RA}} - \underbrace{\mathbb{E}\!\left[w_0(D_i, X_i)\left(\Delta Y_i - m_\Delta^{d=0}(X_i)\right)\right]}_{\text{Reweighting-based bias correction}} \end{aligned}\]
Medicaid Expansion
Brazil CAPS Reform
The DR recipe (same in both applications):
DR is consistent if either model is correct.
But what happens to precision when both are right?
Q: If DR gives us two chances, why care about getting both models right?
Sant’Anna and Zhao (2020) derive the semiparametric efficiency bound for ATT under conditional PT. The bound equals the variance of the efficient influence function:
\[\psi_i^{eff} = \big(w_1(D_i) - w_0(D_i, X_i; p_0)\big)\big(\Delta Y_i - m_{\Delta}^{d=0}(X_i)\big) - w_1(D_i)\cdot\text{ATT}\]
Sant’Anna and Zhao (2020) propose two versions:
drdid_panel): standard logit PS + OLS outcome modeldrdid_imp_panel): inverse probability tilted PS (Graham, Pinto, and Egel 2012) + weighted OLS outcome modelThe improved version ensures the estimated PS satisfies an exact balancing condition, improving finite-sample performance
Both are doubly robust and locally efficient under correct specification
Bonus: The improved DR estimator is also doubly robust for inference — no need to adjust standard errors for first-step estimation of \(p(X_i)\) or \(m_\Delta^{d=0}(X_i)\)
RA, IPW, and DR have different robustness and efficiency properties.
Do these properties hold in finite samples?
DGP with true ATT \(= 0\):
7 estimators: Oracle (infeasible), DR-Improved, DR-Traditional, IPW, IPW-Normalized, RA, TWFE
\(n = 500\), \(1{,}000\) MC replications
| DGP 1 Bias | DGP 1 RMSE | DGP 2 Bias | DGP 2 RMSE | DGP 3 Bias | DGP 3 RMSE | DGP 4 Bias | DGP 4 RMSE | |
|---|---|---|---|---|---|---|---|---|
| Both correct | PS wrong | OR wrong | Both wrong | |||||
| TWFE | \(-20.9\) | \(21.1\) | \(-20.5\) | \(20.6\) | \(-28.2\) | \(28.3\) | \(-16.4\) | \(16.5\) |
| Regression | \(0.0\) | \(0.1\) | \(0.0\) | \(0.1\) | \(-6.1\) | \(6.2\) | \(-5.2\) | \(5.3\) |
| IPW (Hajek) | \(0.0\) | \(1.2\) | \(-1.9\) | \(2.2\) | \(0.0\) | \(1.3\) | \(-4.0\) | \(4.2\) |
| DR (Trad.) | \(0.0\) | \(0.1\) | \(0.0\) | \(0.1\) | \(0.0\) | \(1.0\) | \(-3.2\) | \(3.5\) |
| DR (Impr.) | \(0.0\) | \(0.1\) | \(0.0\) | \(0.1\) | \(0.0\) | \(1.0\) | \(-1.0\) | \(2.6\) |
| DGP 1 | DGP 2 | DGP 3 | DGP 4 | |
|---|---|---|---|---|
| Both correct | PS wrong | OR wrong | Both wrong | |
| TWFE | 0.0% | 0.0% | 0.0% | 0.0% |
| Regression | 93.9% | 94.9% | 83.8% | 1.1% |
| IPW (Hajek) | 94.0% | 83.7% | 95.2% | 22.0% |
| DR (Trad.) | 95.3% | 94.9% | 94.6% | 28.4% |
| DR (Impr.) | 94.8% | 94.4% | 94.6% | 26.8% |
DR is the only estimator that performs well across all scenarios. TWFE should not be the default when covariates matter.
Two chances to get it right, and efficient when both models are correct.
Doubly robust is the default for DiD with covariates.
If covariates that are important for outcome changes in the absence of treatment are unbalanced across treated and comparison groups, this raises serious concerns about unconditional PT (Abadie 2005)
Intuition: if groups differ in \(X\), and \(X\) drives \(\Delta Y(\infty)\), then \(\mathbb{E}[\Delta Y(\infty) \mid D=1] \neq \mathbb{E}[\Delta Y(\infty) \mid D=0]\)
This motivates covariate balance diagnostics as part of any DiD analysis — following the broader principle that “design trumps analysis” (Rubin 2008; Baker et al. 2025)
Key diagnostics:
Good balance \(\Rightarrow\) more credible results, less sensitivity to specification choices
Setting: Effect of Medicaid expansion on county-level mortality (Lecture 5 data)
Now incorporate county-level covariates:
Why covariates matter: Expansion states systematically differ from non-expansion states on these characteristics
Conditional PT more plausible than unconditional PT: “Among counties with similar demographics, mortality trends would be parallel absent expansion”
| Covariate | Treated | Comparison | Std. Diff. |
|---|---|---|---|
| % White | 79.5 | 77.9 | \(+0.115\) |
| % Hispanic | 18.9 | 17.0 | \(+0.107\) |
| % Female | 50.1 | 50.5 | \(-0.238\) |
| Unemp. Rate | 8.0 | 7.0 | \(+0.503\) |
| Poverty Rate | 15.3 | 17.2 | \(-0.375\) |
| Median Income ($K) | 57.9 | 49.3 | \(+0.685\) |
Wide confidence intervals — limited power with county-level data
None of the estimates are statistically significant
But the sensitivity of estimates to covariate inclusion is itself informative:
Good overlap and balance — the “design” checks out
Covariates matter for credibility even if they do not dramatically change point estimates
Medicaid had 6 covariates and good overlap. What happens with a richer covariate set?
Brazil’s psychiatric reform: 30 covariates + state FE.
Dias and Fontes (2024): Brazil’s 2002 Psychiatric Reform created CAPS (community mental health centers) replacing psychiatric hospitals
Staggered rollout across 5,180 municipalities (2002–2016)
Our \(2 \times 2\) setup: g = 2006 (early CAPS adopters) vs. never-treated; pre = 2005, post = 2007
Outcome: Assault homicide rate per 10,000 population
30 covariates: Demographics, income, transfers, poverty, geographic characteristics, health infrastructure (from 2000 census and administrative data) + state fixed effects
Surprising finding: CAPS adoption increases homicides — consistent with the Penrose hypothesis (see Dias and Fontes 2024) that deinstitutionalization reduces incapacitation
did/DRDID default: trim untreated with \(\hat{p}(X_i) > 0.995\) only; units with \(\hat{p}(X_i) \approx 0\) self-trim via vanishing weights
In both applications, covariates may not dramatically change point estimates
But they dramatically change credibility:
The “design phase” (balance diagnostics) is crucial for transparency
Bottom line: Even when estimates are stable, the exercise of checking matters
In these data, covariates predict treatment adoption but DR and unconditional estimates are broadly similar — the exercise of checking is what matters.
We have seen the theory, diagnostics, and empirical results.
How do we implement this in practice?
did Package (Primary)library(did)
# Callaway & Sant'Anna (2021) with covariates
# Uses doubly robust estimation by default
result <- att_gt(
yname = "l_homicide",
tname = "year",
idname = "sid",
gname = "first_treat",
xformla = ~ x1 + x2 + x3,
data = my_data,
control_group = "notyettreated",
est_method = "dr", # DR is the default
base_period = "universal" # use first period as base
)
# Aggregate to event study
es <- aggte(result, type = "dynamic")DRDID Package (Low-Level)library(DRDID)
# Panel data: Doubly Robust DiD (improved)
result_dr <- drdid(yname = "y", tname = "post",
idname = "id", dname = "treat",
xformla = ~ x1 + x2 + x3,
data = panel_data, panel = TRUE)
# Also available: ipwdid(), ordid()
# For low-level functions, note the intercept convention:
# drdid_imp_panel: needs cbind(1, X) explicitly
# twfe_did_panel: adds intercept internally (do NOT add)The did package wraps DRDID and handles the intercept convention automatically. Use DRDID for more control.
Step-by-step guide for DiD with covariates:
What if we observe repeated cross-sections instead of a panel?
New challenges: compositional changes and stationarity.
Panel data: Observe same units in both periods
Repeated cross-sections (RCS): Different units sampled each period
From Lecture 5: panel data is strictly more efficient than RCS
With covariates, the gap between panel and RCS has additional nuances
Assumption (RCS Sampling). The pooled RCS data \(\{Y_i, D_i, X_i, T_i\}_{i=1}^n\) are iid draws from: \[P(Y, D, X, T) = \lambda \cdot P(Y_2, D, X \mid T\!=\!2)\cdot\mathbf{1}\{T\!=\!2\} + (1\!-\!\lambda) \cdot P(Y_1, D, X \mid T\!=\!1)\cdot\mathbf{1}\{T\!=\!1\}\] where \(\lambda = \mathbb{P}(T = 2) \in (0,1)\).
Assumption (Stationarity / No Compositional Changes). The joint distribution of \((G_i, X_i)\) is the same across time periods: \[(G_i, X_i) \mid T_i = 1 \overset{d}{=} (G_i, X_i) \mid T_i = 2\]
In words: the “composition” of units sampled in each period is stable
Automatic in panel data (same units observed each period)
Not automatic in RCS: differential migration, attrition, survey redesigns can change who is sampled
Standard DR DiD estimators for RCS assume stationarity
When the covariate distribution shifts between periods (migration, attrition, survey redesign), standard RCS estimators are biased.
IPW for RCS (Abadie 2005): \[\text{ATT}^{ipw,rc} = \frac{1}{\mathbb{E}[D]}\;\mathbb{E}\!\left[\frac{D - p(X)}{1 - p(X)}\;\frac{T - \lambda}{\lambda(1-\lambda)}\; Y\right]\]
Compared to panel IPW, the RCS version uses an additional reweighting factor \(\frac{T - \lambda}{\lambda(1-\lambda)}\) that adjusts for the time dimension
Still requires the same overlap condition: \(p(X_i) < 1\) a.s.
The propensity score \(p(X) = \mathbb{P}(D = 1 \mid X)\) is estimated on pooled data across both periods — this is valid under the no-compositional-changes assumption
The efficient DR estimand (Sant’Anna and Zhao 2020) models all four \((d,t)\) cells:
\[\begin{aligned} \text{ATT}^{dr,rc}_{eff} &= \underbrace{\mathbb{E}\Big[\tfrac{D}{\mathbb{E}[D]}\big(m_\Delta^{d=1}(X) - m_\Delta^{d=0}(X)\big)\Big]}_{\text{RA component}} \\[-0.2em] &\quad + \underbrace{\mathbb{E}\Big[w_{t=2}^{d=1}(Y - m_{t=2}^{d=1}(X)) - w_{t=1}^{d=1}(Y - m_{t=1}^{d=1}(X))\Big]}_{\text{treated bias correction}} \\[-0.2em] &\quad - \underbrace{\mathbb{E}\Big[w_{t=2}^{d=0}(Y - m_{t=2}^{d=0}(X)) - w_{t=1}^{d=0}(Y - m_{t=1}^{d=0}(X))\Big]}_{\text{comparison bias correction}} \end{aligned}\]
where \(m_\Delta^{d}(x) = m_{t=2}^{d}(x) - m_{t=1}^{d}(x)\). Weights defined on the next slide.
Treated weights (simple — just normalize within \((d\!=\!1, t)\) cells): \[w_{t}^{d=1}(D_i, T_i) = \frac{D_i \cdot \mathbf{1}\{T_i\!=\!t\}}{\mathbb{E}[D \cdot \mathbf{1}\{T\!=\!t\}]} \quad \text{for } t = 1, 2\]
Comparison weights (reweight to match treated covariate distribution): \[w_{t}^{d=0}(D_i, T_i, X_i) = \frac{\frac{p(X_i)(1-D_i)\,\mathbf{1}\{T_i\!=\!t\}}{1-p(X_i)}}{\mathbb{E}\!\left[\frac{p(X)(1-D)\,\mathbf{1}\{T\!=\!t\}}{1-p(X)}\right]} \quad \text{for } t = 1, 2\]
Compare with the panel case: RCS needs four weights (one per \((d,t)\) cell) instead of two.
Compositional changes: the distribution of \((G, X)\) differs across periods
Happens when:
When stationarity fails, standard RCS estimators are biased
The bias arises because the “comparison group trend” is contaminated by compositional shifts
Sant’Anna and Xu (2026) propose estimators that do not require stationarity
Key innovation: rate double robustness
Additional nuisance parameter: model for compositional changes
Nonparametric nuisance estimation: all nuisance functions (\(m_t^d\), \(p\), \(\pi\)) estimated nonparametrically — no need to assume linear/logistic models
DML-style procedures:
Leave-one-out estimation:
These innovations ensure \(\sqrt{n}\)-consistent and asymptotically normal estimators even when nuisance functions are estimated at slower-than-\(\sqrt{n}\) rates
Sant’Anna and Xu (2026) derive a DR estimand that does not require stationarity:
\[\begin{aligned} \tau_{dr}^{cc} &= \mathbb{E}\Big[w_{t=2}^{d=1,cc}\big(m_\Delta^{d=1}(X) - m_\Delta^{d=0}(X)\big)\Big] \\[-0.2em] &\quad + \mathbb{E}\Big[w_{t=2}^{d=1,cc}(Y - m_{t=2}^{d=1}(X)) - w_{t=1}^{d=1,cc}(Y - m_{t=1}^{d=1}(X))\Big] \\[-0.2em] &\quad - \mathbb{E}\Big[w_{t=2}^{d=0,cc}(Y - m_{t=2}^{d=0}(X)) - w_{t=1}^{d=0,cc}(Y - m_{t=1}^{d=0}(X))\Big] \end{aligned}\]
Key: weights use generalized PS \(\pi(d,t,x) = \mathbb{P}(D\!=\!d, T\!=\!t \mid X)\). Weights rebalance each \((d,t)\) cell to the treated-at-\(t\!=\!2\) distribution.
Intuition: \(\pi(1,2,X_i)/\pi(d,t,X_i)\) reweights each \((d,t)\) cell to match the treated post-treatment covariate distribution — the same IPW logic of “make comparison look like treated,” but now applied within each \((d,t)\) cell separately.
Key diagnostic: Compare estimators that use stationarity vs. those that do not
Under stationarity: both should give the same answer
Under compositional changes: they will diverge
Sant’Anna and Xu (2026) formalize this as a Hausman-type test: \[H_0: \text{stationarity holds} \quad \text{vs.} \quad H_1: \text{compositional changes}\]
Test statistic based on the difference between two DR estimators
Rejection \(\Rightarrow\) use the estimator that allows for compositional changes
Sant’Anna and Xu (2026) revisit Sequeira (2016)’s study of South Africa’s tariff liberalization on trade with Mozambique
Data: repeated cross-sections of trade flows across product categories
Compositional changes are plausible: product mix changes over time as trade patterns evolve
Results:
Panel data: stationarity holds by construction — not an issue
RCS data: always ask:
Practical advice:
Q: In your own research, how would you diagnose whether compositional changes are present?
What if we have many potential covariates?
From low to high dimensions: machine learning meets DiD.
We now enter high-dimensional territory: many covariates, flexible estimation
This section sketches the key ideas; a full treatment requires substantially more time than we have available
The practical takeaway:
Focus on the recipe and the intuition — we will not derive convergence rates or prove oracle inequalities
If you plan to use these methods in your dissertation, see (Chernozhukov2018?) for the full DML framework and Belloni, Chernozhukov, and Hansen (2014) for Post-LASSO inference theory.
So far: \(X\) is low-dimensional and we specify parametric models (OLS, logit)
But what if we have many potential confounders?
Standard approach: include “all reasonable” covariates \(\Rightarrow\) overfitting, instability
Can we do better?
Machine learning offers principled ways to handle high-dimensional \(X\):
But: naively plugging ML into DR creates an overfitting bias problem
Solution: cross-fitting — estimate nuisance functions on one sample, evaluate on another
This lecture: LASSO \(\rightarrow\) cross-fitting \(\rightarrow\) DML-DiD
With \(k\) covariates and \(n\) observations:
The bias-variance tradeoff:
We need methods that regularize: shrink or select to control complexity
Key insight: the DR structure provides a natural framework for ML integration
Recall the DR DiD estimand (panel data):
\[\text{ATT}^{dr} = \mathbb{E}\left[(w_1(D) - w_0(D, X; p))(\Delta Y - m_\Delta^{d=0}(X))\right]\]
Two nuisance functions to estimate:
The DR property means: bias from estimating these nuisance functions is a product of their respective errors
This is exactly the right structure for ML: we can use flexible methods for nuisance estimation while maintaining valid inference for the ATT
\[\hat{\beta}^{lasso} = \arg\min_\beta \frac{1}{n}\sum_{i=1}^n (Y_i - X_i'\beta)^2 + \lambda \sum_{j=1}^k |\beta_j|\]
The \(\ell_1\) penalty \(\lambda \|\beta\|_1\) serves dual purpose:
Useful when the true model is sparse: only \(s \ll k\) covariates truly matter
The tuning parameter \(\lambda\) controls the bias-variance tradeoff
Exact sparsity: Only \(s\) coefficients are nonzero (restrictive)
Approximate sparsity: Many small coefficients, but the best \(s\)-sparse approximation is close to the truth
More realistic: covariates may all contribute, but most only marginally
LASSO works well under approximate sparsity: it automatically finds the most important variables and provides a good approximation
Key rate requirement: \(s^2 \log(k) / n \to 0\) (sparsity grows slowly relative to \(n\))
LASSO shrinks coefficients toward zero \(\Rightarrow\) downward bias in fitted values
Post-LASSO (Belloni and Chernozhukov 2013; Belloni et al. 2017): Two-step procedure
Post-LASSO reduces bias while maintaining the sparsity-driven variable selection
Achieves the same rate of convergence as LASSO but with better constants
\[\hat{\gamma}^{lasso} = \arg\min_\gamma \;-\frac{1}{n}\sum_{i=1}^n \ell_i(\gamma) + \lambda \|\gamma\|_1, \quad \ell_i(\gamma) = D_i \log \Lambda(X_i'\gamma) + (1\!-\!D_i)\log(1 \!-\! \Lambda(X_i'\gamma))\]
\[\hat{\beta}^{lasso} = \arg\min_\beta \frac{1}{n_0}\sum_{i: D_i=0} (\Delta Y_i - X_i'\beta)^2 + \lambda \|\beta\|_1\]
Estimated using comparison group only (same as RA)
Post-LASSO OLS: Select variables, then refit OLS on selected set
LASSO automatically discovers which covariates predict \(\Delta Y\) among untreated
Plugging ML-estimated nuisance functions into the DR formula can yield valid inference, but requires Donsker-type conditions on the nuisance function classes, plus stronger rate/smoothness/sparsity requirements
ML estimators (LASSO, random forests, neural nets) typically violate these conditions — their complexity grows with \(n\)
Without these conditions: using the same data to (i) fit nuisance functions and (ii) evaluate the DR formula creates regularization bias that may not vanish at \(\sqrt{n}\) rate
Intuition: ML adapts to noise in the training data, and this noise “leaks” into the DR estimator
Solution: Sample splitting / cross-fitting (Chernozhukov et al. 2018) — avoids Donsker conditions entirely
Double/Debiased Machine Learning (DML) (Chernozhukov et al. 2018):
Key property: Estimation and evaluation use different data \(\Rightarrow\) no overfitting bias
Allows \(\sqrt{n}\)-consistent and asymptotically normal ATT estimates even with slow-converging ML nuisance estimators (requires product of estimation errors \(\|\hat{m} - m_0\| \cdot \|\hat{p} - p_0\| = o_p(n^{-1/2})\))
Works with any ML method (LASSO, random forests, neural networks, …)
Input: \(\{Y_{i,t=1}, Y_{i,t=2}, D_i, X_i\}_{i=1}^n\), \(K\) folds
Inference: IF-based plug-in variance or multiplier bootstrap (Belloni et al. 2017; Chernozhukov et al. 2018)
We have the tools: LASSO + cross-fitting + DR.
How well does DML-DiD perform in finite samples?
Three DGPs with \(p = 100\) covariates (\(s = 5\) active), \(n = 500\):
Compare: LASSO-DR (with cross-fitting), Linear DR, TWFE
200 MC replications (ML is computationally intensive)
| DGP 1 Bias | DGP 1 RMSE | DGP 2 Bias | DGP 2 RMSE | DGP 3 Bias | DGP 3 RMSE | |
|---|---|---|---|---|---|---|
| Uncond. PT | Cond. PT, homog. | Cond. PT, heterog. | ||||
| Unconditional DiD | \(-0.1\) | \(12.3\) | \(64.0\) | \(160.1\) | \(67.0\) | \(148.2\) |
| RA + LASSO | \(-0.1\) | \(12.3\) | \(2.9\) | \(14.9\) | \(-3.2\) | \(16.3\) |
| IPW + LASSO | \(0.4\) | \(14.2\) | \(-19.2\) | \(157.1\) | \(-16.2\) | \(155.4\) |
| DR + LASSO | \(0.4\) | \(14.0\) | \(0.9\) | \(14.7\) | \(-5.1\) | \(16.8\) |
| DR + Causal Forest | \(-0.2\) | \(12.5\) | \(44.1\) | \(76.0\) | \(41.1\) | \(70.9\) |
| DGP 1 | DGP 2 | DGP 3 | |
|---|---|---|---|
| Uncond. PT | Cond. PT, homog. | Cond. PT, heterog. | |
| Unconditional DiD | 96.5% | 91.0% | 94.5% |
| RA + LASSO | 96.5% | 92.5% | 93.0% |
| IPW + LASSO | 95.0% | 96.0% | 96.0% |
| DR + LASSO | 95.0% | 94.5% | 92.0% |
| DR + Causal Forest | 96.0% | 87.5% | 92.0% |
DR + LASSO with cross-fitting provides the best bias-variance tradeoff. Causal forests can estimate heterogeneous effects but are less reliable for average ATT.
DML-DiD estimates the average ATT with high-dimensional nuisance functions.
What if treatment effects vary across units?
Beyond estimating the average ATT: what about \(\text{CATT}(x)\)?
Generalized Random Forests (GRF) (Athey, Tibshirani, and Wager 2019):
Causal forests for DiD:
Combines the DR framework with forest-based heterogeneity estimation
Suppose you care about treatment effects conditional on a small subset of covariates — say, gender, race, or income — but need many more covariates to justify conditional parallel trends. How should you proceed?
This connects to \(\text{CATT}(x)\) — but the “\(x\)” you care about is low-dimensional, while the “\(X\)” for identification is high-dimensional
Tension: averaging over nuisance covariates while conditioning on covariates of interest
We leave this as an open question for you to think about
ML can estimate nuisance functions and uncover heterogeneity.
When does all this machinery actually help?
ML is useful when:
ML is overkill when:
ML is a tool, not a magic bullet. Use it when the complexity of the nuisance functions justifies the additional machinery. For simple settings, parametric DR is simpler.
ML does not fix identification problems:
Computational cost: cross-fitting with LASSO/forests is slower than OLS
Interpretability: harder to understand what’s driving the estimates
Finite-sample performance: ML guarantees are asymptotic; with \(n = 200\), parametric methods often work better
Bottom line: ML extends the toolkit, but the hard work is still in the identification assumptions and study design
Three estimation strategies (RA, IPW, DR), RCS extensions, and ML integration.
What should practitioners take away from all this?
Key takeaways from DiD with covariates:
Conditional PT is often more credible than unconditional PT — but requires appropriate estimation methods
TWFE with covariates is fragile: imposes no covariate-specific trends, hidden linearity bias, potentially negative weights
Doubly robust is the default: consistent if either outcome model or propensity score is correct; efficient when both are
Panel > RCS: strictly more efficient, no stationarity concerns. For RCS, test for compositional changes
ML extends DR naturally: LASSO + cross-fitting for high-dimensional settings; causal forests for heterogeneity

ECON 730 | Causal Panel Data | Pedro H. C. Sant’Anna