Lecture 4: Efficient Estimation for Staggered Rollout Designs
Emory University
Spring 2026
Two natural questions: Is the sampling-based approach to inference adequate? How should we optimally use pre-treatment data when treatment timing is random?
Re-analysis by Wood et al. (2020) using full baseline adjustment (\(\beta = 1\)):
Officers averaged 0.044 complaints/month before training. Estimates suggest \(\approx 11\%\) reduction — but confidence intervals are wide.
Is this imprecision inherent to the data, or is there a more efficient way to use pre-treatment information?
How should we optimally use pre-treatment data when treatment timing is random?
By finding the efficient adjustment weight.
In Roth et al. (2023), we introduce a design-based framework and derive the efficient estimator for staggered rollout designs.
Key results:
staggeredFramework: Staggered rollout with random treatment timing
Special Case: Two-period model (building intuition)
Causal Parameters: Aggregations of \(ATE(g,t)\)
Class of Estimators: \(\hat{\theta}_\beta = \hat{\theta}_0 - \hat{X}'\beta\)
The Efficient Estimator: \(\beta^*\) and plug-in version
Inference: \(t\)-based CIs and Fisher Randomization Tests
R Implementation: The staggered package
Application: Police training revisited
The core question: In a randomized experiment, should you adjust for baseline covariates? How much? This is the classical regression adjustment debate (Freedman 2008; Lin 2013). Here, the “covariate” is the pre-treatment outcome.
Design-based framework: Potential outcomes \(Y_{i,t}(\cdot)\) and cohort sizes \(N_g = \sum_i \mathbf{1}[G_i = g]\) are fixed; only the treatment assignment \(G\) is stochastic.
Connection to Lecture 2: Same \(Y_{i,t}(g)\) notation. Connection to Lecture 3: Same design-based perspective, but now with absorbing staggered treatment.
Randomization Inference: potential outcomes are fixed; only treatment timing \(G\) is permuted
Actual Sample
| Unit | \(Y(2)\) | \(Y(3)\) | \(Y(4)\) | \(G_i\) |
|---|---|---|---|---|
| 1 | ? | ? | \(\checkmark\) | 4 |
| 2 | \(\checkmark\) | ? | ? | 2 |
| 3 | ? | ? | \(\checkmark\) | 4 |
| 4 | ? | \(\checkmark\) | ? | 3 |
| \(N\) | \(\checkmark\) | ? | ? | 2 |
Alternative I
| Unit | \(Y(2)\) | \(Y(3)\) | \(Y(4)\) | \(G_i\) |
|---|---|---|---|---|
| 1 | ? | \(\checkmark\) | ? | 3 |
| 2 | ? | \(\checkmark\) | ? | 3 |
| 3 | ? | ? | \(\checkmark\) | 4 |
| 4 | \(\checkmark\) | ? | ? | 2 |
| \(N\) | \(\checkmark\) | ? | ? | 2 |
Alternative II
| Unit | \(Y(2)\) | \(Y(3)\) | \(Y(4)\) | \(G_i\) |
|---|---|---|---|---|
| 1 | ? | ? | \(\checkmark\) | 4 |
| 2 | ? | \(\checkmark\) | ? | 3 |
| 3 | \(\checkmark\) | ? | ? | 2 |
| 4 | ? | ? | \(\checkmark\) | 4 |
| \(N\) | ? | ? | \(\checkmark\) | 3 |
Potential outcomes are the same across samples. Only \(G_i\) (and hence which POs we observe) changes.
Assumption (Random Treatment Timing). Let \(G = (G_1, \ldots, G_N)\). Then:
\[\Pr(G = \tilde{g}) = \frac{\prod_{g \in \mathcal{G}} N_g!}{N!} \quad \text{if } \sum_i \mathbf{1}[\tilde{g}_i = g] = N_g \text{ for all } g, \text{ and zero otherwise.}\]
Assumption (No Anticipation). For all units \(i\), periods \(t\), and treatment dates \(g, g' > t\):
\[Y_{i,t}(g) = Y_{i,t}(g')\]
staggeredBottom line: Random treatment timing provides additional structure beyond basic unconfoundedness. When plausible, it enables more efficient estimation. Balance checks help assess its credibility.
| Outcome: \(Y_i = Y_{i,t{=}2}\) | (post-treatment outcome) |
| Covariate: \(X_i = Y_{i,t{=}1} \equiv Y_{i,t{=}1}(\infty)\) | (pre-treatment outcome) |
| Treatment: \(D_i = \mathbf{1}[G_i = 2]\) | (treatment indicator) |
We use the two-period case as a running example to build intuition before moving to the general staggered case.
Target: \(\theta = \frac{1}{N} \sum_{i=1}^{N} [Y_{i,2}(2) - Y_{i,2}(\infty)]\) \(\qquad\) Class of estimators: \[\hat{\theta}_\beta = \underbrace{(\bar{Y}_{22} - \bar{Y}_{2\infty})}_{\text{Post-treatment diff}} - \beta \underbrace{(\bar{Y}_{12} - \bar{Y}_{1\infty})}_{\text{Pre-treatment diff}}\]
where \(\bar{Y}_{sg} = N_g^{-1} \sum_{i:\, G_i = g} Y_{i,s}\) is the period-\(s\) sample mean for cohort \(g\), and \(\beta \in \mathbb{R}\).
| \(\beta = 0\): | No adjustment (difference-in-means) | ignores pre-treatment data |
| \(\beta = 1\): | Full baseline adjustment | subtracts entire pre-treatment diff |
| \(\beta \in (0,1)\): | Partial adjustment | intermediate correction |
| \(\beta < 0\) or \(\beta > 1\): | Extrapolation | also allowed; optimal \(\beta^*\) may lie here |
Isomorphic to regression adjustment in experiments (Freedman 2008; Lin 2013). The pre-treatment outcome \(Y_{i,1}\) serves as the covariate.
The average effect of switching treatment timing from \(g'\) to \(g\) at period \(t\):
\[\tau_{t,gg'} = \frac{1}{N} \sum_{i=1}^{N} \bigl[Y_{i,t}(g) - Y_{i,t}(g')\bigr]\]
Definition (General Scalar Estimand).
\[\theta = \sum_{t,g,g'} a_{t,gg'} \, \tau_{t,gg'}\]
where \(a_{t,gg'} \in \mathbb{R}\) are known weights with \(a_{t,gg'} = 0\) if \(t < \min(g, g')\).
Connection to Lecture 2: \(\tau_{t,g\infty} = ATE(g, t)\) is the group-time treatment effect. All aggregation parameters from Lecture 2 are special cases of \(\theta\).
Simple: \(\theta^{simple} = \frac{1}{\sum_t \sum_{g \leq t} N_g} \sum_t \sum_{g \leq t} N_g \cdot ATE(g,t)\)
Calendar-time: \(\theta^{calendar} = \frac{1}{T} \sum_t \theta_t\), where \(\theta_t = \frac{1}{\sum_{g \leq t} N_g} \sum_{g \leq t} N_g \cdot ATE(g,t)\)
Cohort: \(\theta^{cohort} = \frac{1}{\sum_{g \neq \infty} N_g} \sum_{g \neq \infty} N_g \, \theta_g\), where \(\theta_g = \frac{1}{T - g + 1} \sum_{t \geq g} ATE(g,t)\)
Event-study (lag \(l\)): \(\theta^{ES}_l = \frac{1}{\sum_{g: g+l \leq T} N_g} \sum_{g: g+l \leq T} N_g \cdot ATE(g, g{+}l)\)
These are the same aggregation parameters from Lecture 2 (Callaway and Sant’Anna 2021). Here we apply them to the design-based finite-population setting.
Definition (Estimator Class). Let \(\bar{Y}_{tg} = N_g^{-1} \sum_i \mathbf{1}[G_i = g] \, Y_{i,t}\) be the sample mean for cohort \(g\) at period \(t\), and \(\hat{\tau}_{t,gg'} = \bar{Y}_{tg} - \bar{Y}_{tg'}\).
Plug-in estimator: \(\hat{\theta}_0 = \sum_{t,g,g'} a_{t,gg'} \, \hat{\tau}_{t,gg'}\)
General class:
\[\hat{\theta}_\beta = \hat{\theta}_0 - \hat{X}'\beta\]
where \(\hat{X}\) is an \(M\)-vector of pre-treatment comparisons: \(\hat{X}_j = \sum_{(t,g,g'): g,g'>t} b^j_{t,gg'} \, \hat{\tau}_{t,gg'}\)
Under no anticipation, \(\mathbb{E}[\hat{X}] = 0\). The vector \(\hat{X}\) compares cohorts before either was treated — any pre-treatment difference is “noise” from randomization.
Callaway and Sant’Anna (2021): For estimating \(ATE(g,t)\) using never-treated: \[\hat{\tau}^{CS}_{t,g} = \underbrace{(\bar{Y}_{tg} - \bar{Y}_{t\infty})}_{\hat{\theta}_0:\text{ post-treatment diff}} - \underbrace{(\bar{Y}_{g-1,g} - \bar{Y}_{g-1,\infty})}_{\hat{X}:\text{ pre-treatment diff}}\]
This is \(\hat{\theta}_\beta\) with \(\beta = 1\): full baseline adjustment using period \(g{-}1\) as the pre-treatment reference.
Other estimators in the class:
Key observation: All existing approaches use \(\beta = 1\) (or fixed). None optimize over \(\beta\)!
Proposition (Unbiasedness — Lemma 2.1). Under Assumptions 1 (Random Treatment Timing) and 2 (No Anticipation):
\[\mathbb{E}[\hat{\theta}_\beta] = \theta \quad \text{for all } \beta \in \mathbb{R}^M\]
Proof sketch:
Since all \(\beta\) give unbiased estimators, we are free to choose \(\beta\) to minimize variance.
Proposition 2.1: The variance of \(\hat{\theta}_\beta\) is uniquely minimized at:
\[\beta^* = \text{Var}[\hat{X}]^{-1} \, \text{Cov}[\hat{X}, \hat{\theta}_0]\]
Proof: This is just OLS! We are minimizing \[\text{Var}[\hat{\theta}_0 - \hat{X}'\beta] = \text{Var}\bigl[(\hat{\theta}_0 - \theta) - (\hat{X} - 0)'\beta\bigr]\] over \(\beta\). The solution is the best linear predictor of \(\hat{\theta}_0\) given \(\hat{X}\).
Intuition: Adjust more for pre-treatment differences when they are more predictive of post-treatment differences. The optimal \(\beta^*\) balances full adjustment (\(\beta = 1\)) and no adjustment (\(\beta = 0\)).
Joint variance structure: \[\text{Var}\begin{pmatrix} \hat{\theta}_0 \\ \hat{X} \end{pmatrix} = \begin{pmatrix} V_{\theta_0} & V_{\theta_0, X} \\ V_{X, \theta_0} & V_X \end{pmatrix}\]
where the components involve:
Critical insight: \(\beta^* = V_X^{-1} V_{X, \theta_0}\) depends only on \(S_g\) (estimable!), not on \(S_\theta\). This is because \(V_X\) and \(V_{X, \theta_0}\) involve only within-cohort covariances.
We estimate \(S_g\) with the within-cohort sample covariance \(\hat{S}_g = (N_g - 1)^{-1} \sum_i \mathbf{1}[G_i = g] (\mathbf{Y}_i - \bar{\mathbf{Y}}_g)(\mathbf{Y}_i - \bar{\mathbf{Y}}_g)'\).
In the two-period case: \[\beta^* = \frac{N_\infty}{N} \beta_\infty + \frac{N_2}{N} \beta_2\]
where \(\beta_g\) is the regression coefficient from regressing \(Y_{i,2}(g)\) on \(Y_{i,1}\) (plus constant).
| Scenario | \(\beta^*\) | Optimal estimator |
|---|---|---|
| \(Y_{i,2}(g)\) uncorrelated with \(Y_{i,1}\) | \(\approx 0\) | No adjustment |
| High autocorrelation (\(\beta_g \approx 1\)) | \(\approx 1\) | Full adjustment |
| Intermediate autocorrelation | \(\in (0,1)\) | Partial adjustment |
| Mean reversion (\(\beta_g < 0\)) | \(< 0\) | Reverse adjustment |
| Low autocorrelation + heterogeneity | \(> 1\) | Over-adjustment |
Connection to experiments: This is exactly the Lin (2013) result for covariate adjustment in randomized experiments, with \(Y_{i,1}\) as the “covariate.”
Bridge: Lecture 3 established that HT-type estimators are unbiased under randomization. This lecture asks: among all unbiased estimators, which is most precise?
Regularity conditions: As \(N \to \infty\), cohort shares converge (\(N_g / N \to p_g \in (0,1)\)), variances \(S_g\) have positive definite limits, and a Lindeberg condition holds.
Proposition 2.2: \(\sqrt{N}(\hat{\theta}_{\hat{\beta}^*} - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma_*^2)\). No efficiency loss from estimating \(\beta^*\) — the plug-in achieves the same asymptotic variance as the oracle.
\(t\)-based CI: \(\hat{\theta}_{\hat{\beta}^*} \pm z_{1-\alpha/2} \cdot \hat{\sigma}_{**} / \sqrt{N}\), where \(\hat{\sigma}_{**}^2\) converges to an upper bound on the true variance.
What about Fisher Randomization Tests?
Under random treatment timing, we can construct permutation tests:
Proposition 2.3: The studentized FRT is:
Studentization is essential! Without it, FRTs may not have correct size for the weak null (Wu and Ding 2021; Zhao and Ding 2021).
Simulation with synthetic data.
| Sharp null (\(H_0^S\)): \(Y_i(g) = Y_i(g')\) for all \(i, g, g'\) | \(\to\) FRT is exact |
| Weak null (\(H_0^W\)): \(\theta = 0\) | \(\to\) FRT may over-reject! |
Under heterogeneous effects, unstudentized FRTs have severe size distortions. Studentization restores correct size (Wu and Ding 2021; Zhao and Ding 2021).
Practical implication: Even if you use CS estimators (under weaker assumptions discussed in later lectures), you can supplement with FRT-based \(p\)-values when random timing is plausible. The staggered package implements this.
staggeredInstallation: install.packages("staggered")
Main functions:
staggered(): Efficient estimator (Roth et al. 2023)staggered_cs(): Callaway and Sant’Anna (2021) estimatorstaggered_sa(): Sun and Abraham (2021) estimatorBuilt-in dataset: pj_officer_level_balanced (police training application)
Output: Returns estimate, se, se_neyman
Additional output: fisher_pval (permutation \(p\)-value)
# Calendar-time weighted average
staggered(df = pj_officer_level_balanced,
i = "uid", t = "period",
g = "first_trained", y = "complaints",
estimand = "calendar")
# Cohort-weighted average
staggered(..., estimand = "cohort")
# Event-study: effects at lags 0 through 11
staggered(df = pj_officer_level_balanced,
i = "uid", t = "period",
g = "first_trained", y = "complaints",
estimand = "eventstudy",
eventTime = 0:11)Event-study returns one row per event time, each with its own estimate and SE.
# Efficient estimator (Roth & Sant'Anna)
res_eff <- staggered(df = pj_officer_level_balanced,
i = "uid", t = "period", g = "first_trained",
y = "complaints", estimand = "simple")
# Callaway & Sant'Anna (2021)
res_cs <- staggered_cs(df = pj_officer_level_balanced,
i = "uid", t = "period", g = "first_trained",
y = "complaints", estimand = "simple")
# Sun & Abraham (2021)
res_sa <- staggered_sa(df = pj_officer_level_balanced,
i = "uid", t = "period", g = "first_trained",
y = "complaints", estimand = "simple")
# Unadjusted (no pre-treatment adjustment)
res_unadj <- staggered(df = pj_officer_level_balanced,
i = "uid", t = "period", g = "first_trained",
y = "complaints", estimand = "simple",
beta = 1e-16, use_DiD_A0 = TRUE)Comparison: Efficient estimator vs. Callaway-Sant’Anna vs. Sun-Abraham
SE ratios annotated above each CI. Large gains over CS and SA; modest gains over unadjusted (\(\beta=0\)) — efficiency gains vary by application.
SE ratios \(>1\) indicate the efficient estimator has smaller standard errors.
Efficiency gains vary across outcomes. Largest gains for sustained complaints; modest for use of force. SE ratios annotated above each CI.
Efficient estimator (blue) produces tighter CIs at every event time, enabling sharper inference about dynamic treatment effects.
Can we assess the validity of the assumptions?
staggeredResults from police training data:
Balance checks provide evidence for or against the random timing assumption — analogous to covariate balance checks in RCTs.
When treatment timing is random, estimators that use fixed adjustment weights (\(\beta = 0\) or \(\beta = 1\)) “leave money on the table”
Roth et al. (2023) show how to use additional information to “collect the money”
Estimators and inference easily implemented in R via the staggered package
Recommendation: Use when treatment timing is (quasi-)random
Next lecture: Observational panel data — when treatment timing is not random.
| Result | Statement |
|---|---|
| Unbiasedness | \(\hat{\theta}_\beta\) is unbiased for \(\theta\) for any \(\beta\) (Lemma 2.1) |
| Efficiency | \(\beta^* = \text{Var}[\hat{X}]^{-1}\text{Cov}[\hat{X}, \hat{\theta}_0]\) minimizes variance (Prop. 2.1) |
| Feasibility | Plug-in \(\hat{\beta}^*\) achieves same asymptotic variance as oracle (Prop. 2.2) |
| Inference | Studentized FRT: exact under sharp null, valid under weak null (Prop. 2.3) |
| Nesting | \(\beta{=}0\), \(\beta{=}1\), CS, SA all special cases with fixed \(\beta\) |
| Gains | SE reductions of 1.4–8.4\(\times\) in application |
DGP: Draw \(\mathbf{Y}_i(\infty) \sim \mathcal{N}(0, \Sigma_\rho)\); set \(Y_{i,2}(2) = Y_{i,2}(\infty) + \gamma(Y_{i,2}(\infty) - \mathbb{E}_{fin}[Y_{i,2}(\infty)])\)
Comparison: Plug-in efficient vs. full adjustment (\(\beta = 1\)) vs. no adjustment (\(\beta = 0\))
Key findings:
Setup: Calibrated to police training data (72 periods, 48 cohorts, 7,785 officers). Sharp null: \(Y_{it}(g)\) = observed outcome for all \(g\).
Comparison: Plug-in efficient vs. Callaway-Sant’Anna vs. Sun-Abraham
Key findings:
Substantial efficiency gains are achievable in realistic staggered designs when treatment timing is random.
Full Simulation Code: FRT Studentization
This simulation demonstrates why studentization is essential for FRTs under the weak null. Key DGP: heterogeneous treatment effects with unequal group sizes create variance asymmetry that the unstudentized FRT fails to account for.
set.seed(730)
N <- 200; n_treat <- 40; n_ctrl <- 160
n_perms <- 999; n_sims <- 2000; alpha <- 0.05
sigma_tau <- 3 # tau_i ~ N(0, sigma_tau^2), so E[tau]=0 (weak null)
# ---- Permutation Histogram (single simulation) ----
Y0 <- rnorm(N, 0, 1)
tau_i <- rnorm(N, 0, sigma_tau)
D <- c(rep(1, n_treat), rep(0, n_ctrl))
Y <- Y0 + D * tau_i
t_obs <- mean(Y[D == 1]) - mean(Y[D == 0])
t_perm <- replicate(n_perms, {
D_p <- sample(D); mean(Y[D_p == 1]) - mean(Y[D_p == 0])
})
# ---- Rejection Rate Comparison (2000 simulations) ----
reject_unstud <- reject_stud <- 0
for (sim in 1:n_sims) {
Y0 <- rnorm(N); tau_i <- rnorm(N, 0, sigma_tau)
D <- c(rep(1, n_treat), rep(0, n_ctrl))
Y <- Y0 + D * tau_i
t_u <- mean(Y[D==1]) - mean(Y[D==0])
se <- sqrt(var(Y[D==1])/n_treat + var(Y[D==0])/n_ctrl)
t_s <- t_u / se
t_perm_u <- t_perm_s <- numeric(n_perms)
for (p in 1:n_perms) {
Dp <- sample(D)
Yt <- Y[Dp==1]; Yc <- Y[Dp==0]
d <- mean(Yt) - mean(Yc)
t_perm_u[p] <- d
t_perm_s[p] <- d / sqrt(var(Yt)/sum(Dp) + var(Yc)/sum(1-Dp))
}
if (mean(abs(t_perm_u) >= abs(t_u)) < alpha) reject_unstud <- reject_unstud + 1
if (mean(abs(t_perm_s) >= abs(t_s)) < alpha) reject_stud <- reject_stud + 1
}
cat("Unstudentized:", reject_unstud/n_sims,
"| Studentized:", reject_stud/n_sims)Application Figure Generation Code
Figures are generated from the pj_officer_level_balanced dataset in the staggered R package.
library(staggered)
data(pj_officer_level_balanced)
df <- pj_officer_level_balanced
# ---- Efficient vs. CS vs. SA vs. Unadjusted ----
res_eff <- staggered(df = df, i = "uid", t = "period",
g = "first_trained", y = "complaints", estimand = "simple")
res_cs <- staggered_cs(df = df, i = "uid", t = "period",
g = "first_trained", y = "complaints", estimand = "simple")
res_sa <- staggered_sa(df = df, i = "uid", t = "period",
g = "first_trained", y = "complaints", estimand = "simple")
# ---- SE Ratios Across Estimands ----
for (est in c("simple", "calendar", "cohort")) {
r_eff <- staggered(df=df, i="uid", t="period",
g="first_trained", y="complaints", estimand=est)
r_cs <- staggered_cs(df=df, i="uid", t="period",
g="first_trained", y="complaints", estimand=est)
cat(est, ": CS/Eff SE ratio =", r_cs$se / r_eff$se, "\n")
}
# ---- Event Study ----
es_eff <- staggered(df=df, i="uid", t="period",
g="first_trained", y="complaints",
estimand="eventstudy", eventTime=0:11)
es_cs <- staggered_cs(df=df, i="uid", t="period",
g="first_trained", y="complaints",
estimand="eventstudy", eventTime=0:11)
# See Figures/Lecture4/generate_figures.R for full code
ECON 730 | Causal Panel Data | Pedro H. C. Sant’Anna