Lecture 2: Potential Outcomes and Causal Parameters
Emory University
Spring 2026
“If a person eats of a particular dish, and dies in consequence, that is, would not have died if he had not eaten it, people would be apt to say that eating of that dish was the source of his death.” – John Stuart Mill (19th-century moral philosopher and economist)
“Causation is something that makes a difference, and the difference it makes must be a difference from what would have happened without it.” – David Lewis (20th-century philosopher)
Mill’s counterfactuals were immensely valuable for providing a clear and intuitively valid definition of causality.
But it also made it clear that causality is a tricky business!
If I have to know what would have happened had I not eaten the dish, but I did eat the dish, how would I ever be able to know the causal effect of eating the dish?
The same reasoning applies to all the what-if motivating questions we discussed!
This is a valid concern, but this should not stop us from trying to answer these questions.

Source: xkcd.com/552
Although this is not a universal point of view, we will adopt the approach popularized by Holland (1986), “No causation without manipulations”.
“Causes are only those things that could, in principle, be treatments in experiment” (Holland 1986).
“Causes are experiences that units undergo and not attributes that they possess” (Holland 2003).
This restricts the problems we work with, or at least forces us to think about the problem from this angle.
Causal inference is not a prediction problem but rather a counterfactual problem.
This makes things challenging because:
Some modifications and tricks can be used to bypass several of these.
Key: decompose the problem into predictive and causal parts.
Specify the causal question of interest and map that into a causal target parameter.
Figure out a research design using domain knowledge that can credibly answer the question (This usually involves solving identification problems)
State our assumptions, and provide supportive evidence of their credibility in our context.
Pick an estimation and inference method with strong statistical guarantees.
What is the average treatment effect expanding Medicaid in 2014 on mortality rates compared to not expanding it?
What is the average treatment effect of a minimum wage increase in 2004 on employment among states that indeed raised minimum wage in 2004?
What is the average treatment effect of being eligible for 401(k) retirement plans on asset accumulation?
We will adopt the Rubin Causal Model and define potential outcomes.
There are other approaches/languages out there, too, e.g., Judea Pearl’s Directed Acyclic Graph (DAG). They should be seen as complements.
Potential outcomes define outcomes in different states of the world, depending on the type of treatment units assigned to them.
Let \(D\) be a treatment variable.
Let \(Y_{i}(d)\) be the potential outcome for unit \(i\) if they were assigned treatment \(d\).
Each unit \(i\) has a lot of different potential outcomes
What is the average treatment effect of being eligible for 401(k) retirement plans on asset accumulation?
Treatment \(\mathbf{D}:\) \(D_i=1\) if worker \(i\) is eligible for 401(k). \(D_i=0\) if worker \(i\) works in firms that do not offer 401(k).
Potential Outcomes \(\mathbf{Y_i(1)}\), \(\mathbf{Y_i(0)}\) \(Y_i(1)\) asset accumulation for worker \(i\) if eligible for 401(k). \(Y_i(0)\) asset accumulation for worker \(i\) if not eligible for 401(k).
Unit-specific Treatment Effect
The treatment effect or causal effect of switching treatment from \(d'\) to \(d\) is the difference between these two potential outcomes:
\[ Y_{i}(d) - Y_{i}(d')\]
When treatment is binary, \[ Y_{i}(1) - Y_{i}(0)\]
Fundamental problem of causal inference (Holland 1986)
Observed outcome with binary treatments
Observed outcomes for unit \(i\) are realized as \[Y_{i} = 1\{D_i = 1\}Y_{i}(1) + 1\{D_i = 0 \}Y_{i}(0)\]
\[Y_i = \begin{cases} Y_i(1)\text{ if }D_i=1 \\ Y_i(0)\text{ if }D_i=0 \end{cases}\]
| Unit | \(Y_{i}(1)\) | \(Y_{i}(0)\) | \(D_i\) | \(Y_{i}(1)-Y_{i}(0)\) | \(X_i\) |
|---|---|---|---|---|---|
| 1 | ? | ✓ | 0 | ? | \(x_1\) |
| 2 | ✓ | ? | 1 | ? | \(x_2\) |
| 3 | ? | ✓ | 0 | ? | \(x_3\) |
| 4 | ✓ | ? | 1 | ? | \(x_4\) |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| n | ✓ | ? | 1 | ? | \(x_n\) |
✓: Observed data ?: Missing data (unobserved counterfactuals)
Treatment can take multiple values: \(D_i \in \{0, 1, 2, 3\}\)
Potential outcomes for each treatment level:
Observed outcome: \[Y_i = \sum_{d=0}^{3} \mathbf{1}\{D_i = d\} \cdot Y_i(d)\]
We observe exactly one of the four potential outcomes
| Unit | \(Y_{i}(0)\) | \(Y_{i}(1)\) | \(Y_{i}(2)\) | \(Y_{i}(3)\) | \(D_i\) | \(X_i\) |
|---|---|---|---|---|---|---|
| 1 | ✓ | ? | ? | ? | 0 | \(x_1\) |
| 2 | ? | ? | ✓ | ? | 2 | \(x_2\) |
| 3 | ? | ✓ | ? | ? | 1 | \(x_3\) |
| 4 | ? | ? | ? | ✓ | 3 | \(x_4\) |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| n | ? | ? | ✓ | ? | 2 | \(x_n\) |
Key insight: With 4 treatment levels, we observe 1 potential outcome and miss 3 for each unit. The missing data problem is even worse!
Treatment can take any value: \(D_i \in \mathcal{D} \subseteq \mathbb{R}\)
Potential outcomes for each treatment level:
Observed outcome: \[Y_i = Y_i(D_i) \quad \text{where } D_i \text{ is the realized treatment}\]
We observe exactly one point on an infinite dose-response curve
For each unit \(i\):
Causal questions:
Key insight: With continuous treatment, we observe 1 point and miss infinitely many. The missing data problem is infinitely worse!
Very little hope for learning about unit-specific treatment effects
We will acknowledge that learning unit-specific TEs is hard, if not impossible.
We will focus on treatment effects in an average sense, but allow them to vary with \(X\).
For simplicity, I will focus on binary treatment setups.
ATT: The Average Treatment Effect among the Treated units is \[ATT = \mathbb{E}\left[Y_{i}(1) - Y_{i}(0) | D_i = 1\right]\]
What is the average treatment effect of being eligible for 401(k) retirement plans on asset accumulation, among workers that are actually eligible for it? Particularly useful to assess if workers that are eligible for 401(k) benefit from it (accumulated more assets).
ATU: The Average Treatment Effect among Untreated units is \[ATU = \mathbb{E}\left[Y_{i}(1) - Y_{i}(0) | D_i = 0\right]\]
What is the average treatment effect of being eligible for 401(k) retirement plans on asset accumulation, among workers that were not eligible for it? Particularly useful to assess if 401(k) plan would benefit those who were not eligible for it.
ATE: The (overall) Average Treatment Effect is \[ATE = \mathbb{E}\left[Y_{i}(1) - Y_{i}(0)\right]\]
What is the average treatment effect of being eligible for 401(k) retirement plans on asset accumulation among all workers? Particularly useful to assess the value of 401(k) plans if they were available in all firms.
What if we want to express average causal effects as relative lifts?
All the average causal parameters discussed so far are expressed in the same units as \(Y\).
If \(Y\) is expressed in dollars, \(ATE\), \(ATT\) and \(ATU\) will also be expressed in dollars.
If \(Y\) is expressed in number of units shipped, \(ATE\), \(ATT\) and \(ATU\) will also be expressed in number of units shipped.
Sometimes, we want to translate the \(ATE\), \(ATT\) or \(ATU\) into percentage terms.
RATT: The Relative Average Treatment Effect among the Treated units is \[RATT = \dfrac{\mathbb{E}\left[Y_{i}(1) - Y_{i}(0) | D_i = 1\right]}{\mathbb{E}\left[Y_{i}(0) | D_i = 1\right]}\]
RATU: The Relative Average Treatment Effect among Untreated units is \[RATU = \dfrac{\mathbb{E}\left[Y_{i}(1) - Y_{i}(0) | D_i = 0\right]}{\mathbb{E}\left[Y_{i}(0) | D_i = 0\right]}\]
RATE: The Relative Average Treatment Effect is \[RATE = \dfrac{\mathbb{E}\left[Y_{i}(1) - Y_{i}(0)\right]}{\mathbb{E}\left[Y_{i}(0)\right]}\]
What if we want to understand treatment effects in a distributional sense?
Despite their popularity, average treatment effects can mask important treatment effect heterogeneity across different subpopulations, see, e.g., Bitler, Gelbach, and Hoynes (2006).
Let’s say that \(ATE\) for being eligible for 401(k) on asset accumulation is $10,000.
We can focus on different treatment effect parameters beyond the mean to better uncover treatment effect heterogeneity. Leading examples include the distributional and quantile treatment effect parameters.
Let \(F_{Y(d)}(y) = \mathbb{P}(Y(d) \le y)\) denote the marginal distribution of the potential outcome \(Y(d)\).
Let \(F_{Y(d)|D=a}(y) = \mathbb{P}(Y(d) \le y|D=a)\) denote the conditional distribution of the potential outcome \(Y(d)\) among units with treatment \(a\).
DTT(y): The Distributional Treatment Effect among the Treated units is \[DTT(y) = F_{Y(1)|D=1}(y) - F_{Y(0)|D=1}(y).\]
DTU(y): The Distributional Treatment Effect among the Untreated units is \[DTU(y) = F_{Y(1)|D=0}(y) - F_{Y(0)|D=0}(y).\]
DTE(y): The Distributional Treatment Effect is \[DTE(y) = F_{Y(1)}(y) - F_{Y(0)}(y).\]
See, e.g., Firpo (2007), Chen, Hong, and Tarozzi (2008), Firpo and Pinto (2016) and Belloni et al. (2017) for discussions.
Example: 401(k) eligibility effect on assets (illustrative numbers)
Example: 401(k) eligibility effect on assets (illustrative numbers)
Computation at \(y = \$10,000\): \[DTE(\$10k) = F_{Y(1)}(\$10k) - F_{Y(0)}(\$10k) = 0.12 - 0.32 = -0.20\]
Interpretation: 401(k) Eligibility decreases by 20pp the fraction with assets \(\le \$10k\).
All these distributional treatment effect parameters are functional parameters as they vary with the evaluation point \(y\in \mathbb{R}\).
They are all displayed in percentage points, as they are the difference of two distribution functions.
These parameters are bounded between -1 and 1. As a function of \(y\), they all start and end at zero.
Note that if \(y=-\infty\), all of them are zero. If \(y=\infty\), all of them is also zero.
This is equivalent to binarizing the potential outcome using a given threshold, \(\widetilde{Y}_y(d) =1\{Y(d)\leq y\}\), and compute the ATE/ATT/ATU using the binarized outcome.
The appeal is that you do this using many thresholds \(y\), and not only a fixed one.
You need to pay attention to the sign of the parameter:
401(k) example: \(DTE(1,000)\) would measure the difference in the fraction of workers with at most $1,000 in accumulated assets in treated and untreated states.
If this number is positive, it means that the fraction of workers accumulating at most $1,000 in assets is higher when they are eligible for 401(k) than if they were not eligible.
This implies that the fraction of workers accumulating more than $1,000 in assets is smaller when they are eligible for 401(k) than if they were not eligible.
For \(\tau\in\left(0,1\right)\), let \(q_{Y(d)}(\tau) = \inf\left\{y:F_{Y(d)}\left(y\right) \geq\tau\right\}\) denote the quantile function of the potential outcome \(Y(d)\).
We define \(q_{Y(d)|D=1}(\tau)\) and \(q_{Y(d)|D=0}(\tau)\) analogously.
These are quantile functions and are always expressed in the same unit of measure as the potential outcome \(Y(d)\).
QTT(\(\tau\)): The Quantile Treatment Effect among the Treated units is \[QTT(\tau) = q_{Y(1)|D=1}(\tau) - q_{Y(0)|D=1}(\tau).\]
QTU(\(\tau\)): The Quantile Treatment Effect among the Untreated units is \[QTU(\tau) = q_{Y(1)|D=0}(\tau) - q_{Y(0)|D=0}(\tau).\]
QTE(\(\tau\)): The Quantile Treatment Effect is \[QTE(\tau) = q_{Y(1)}(\tau) - q_{Y(0)}(\tau)\]
E.g., if \(QTE(\tau) \approx \$0\) for \(\tau \in (0,0.3)\), \(QTE(\tau) \approx \$5,000\) for \(\tau \in (0.4,0.6)\), and \(QTE(\tau) > \$15,000\) for \(\tau > 0.7\), this would mean that 401(k) eligibility benefits mostly those in the upper-tail of the wealth distribution.
Key insight: QTE fixes a probability (quantile \(\tau\)) and compares asset levels
Example: 401(k) eligibility effect on assets (same distributions as before)
Example: 401(k) eligibility effect on assets (same distributions as before)
Computation at \(\tau = 0.5\) (median): \[QTE(0.5) = q_{Y(1)}(0.5) - q_{Y(0)}(0.5) = \$35k - \$24k = \$11k\]
Interpretation: Eligibility increases median assets by $11,000.
The quantile and distributional treatment parameters discussed up to now are expressed as the difference of two quantile and distribution functions, respectively.
In general, these should not be interpreted as the distribution of the treatment effects, or the quantile of the treatment effects; see, e.g., Heckman, Smith, and Clements (1997), Masten and Poirier (2020), and Callaway (2021).
They can sometimes coincide, but that requires additional assumptions, such as rank-invariance; see, e.g., Heckman, Smith, and Clements (1997) for a discussion.
DoTT(y): The Distribution of Treatment Effect among the Treated units is \[DoTT(y) = F_{Y(1)-Y(0)|D=1}(y).\]
DoTE(y): The Distribution of Treatment Effect \[DoTE(y) = F_{Y(1)-Y(0)}(y).\]
QoTT(\(\tau\)): The Quantile of Treatment Effect among the Treated units is \[QoTT(\tau) = q_{Y(1)-Y(0)|D=1}(\tau).\]
QoTE(\(\tau\)): The Quantile of Treatment Effect is \[QoTE(\tau) = q_{Y(1)-Y(0)}(\tau).\]
Example: Distribution of treatment effects for 401(k) eligibility
Example: Distribution of treatment effects for 401(k) eligibility
Computation at \(y = \$10,000\): \[DoTE(\$10k) = F_{Y(1)-Y(0)}(\$10k) = 0.50\]
Interpretation: 50% of the population has treatment effects \(\le \$10,000\).
Key insight: QoTE fixes a probability (quantile \(\tau\)) and finds treatment effect value
What if we want to understand how the average causal effects vary with covariates?
Let \(X_{all}\) be a set of features/covariates available to you, and let \(X_{s}\) be a subset of \(X_{all}\).
CATE: The Conditional Average Treatment Effect given \(X_{sub}\) is \[CATE_{X_{s}}(x_{s}) = \mathbb{E}\left[Y_{i}(1) - Y_{i}(0) | X_{s} = x_{s}\right]\]
How does the average treatment effect of being eligible for 401(k) retirement plans on asset accumulation vary with age, marital status, and number of kids? What about education?
Note that CATE\((x_s)\) is a functional parameter, as it varies with the covariate values \(x_s\).
Other parameters have similar characterizations.
Example: How does 401(k) eligibility effect vary with age?
Example: How does 401(k) eligibility effect vary with age?
Computations at different ages:
Interpretation: Older workers benefit more from 401(k) eligibility, possibly due to higher earnings and greater ability to contribute.
There Is No
“The” Causal Effect!
Only different averages, distributions, and quantiles
Unit-specific effects \(Y_i(1) - Y_i(0)\) vary across units
When we report “the ATT” or “the ATE,” we’re reporting one particular average
Different parameters answer different questions:
Key Insight: The “right” parameters are ones that are meaningful for your research question. More complex parameters can reveal richer heterogeneity, but may also be harder to learn.
All these are very well motivated in cross-sectional setups.
But what if we have panel data?
Do we need to adapt these different parameters?
Consider these counterfactual questions:
Takeaway: We need potential outcomes indexed by treatment histories, not just current treatment status
Setting:
Example: Card and Krueger (1994)
Only two treatment sequences matter:
With a single treatment time, we can simplify:
This maps to our cross-sectional notation:
Key insight: All parameters from earlier apply, but now indexed by time: \(ATT(t) = \mathbb{E}[Y_{it}(g) - Y_{it}(\infty) \mid G_i = g]\).
Can study dynamic effects: How does \(ATT(t)\) change as \(t\) increases?
Remember our cross-sectional parameters? They all extend to panel data!
ATE\((t)\) \(= \mathbb{E}[Y_t(g) - Y_t(\infty)]\) - Effect if everyone were treated at \(g\) vs. never
QTT\((t,\tau)\) \(= q_{Y_t(g)|G=g}(\tau) - q_{Y_t(\infty)|G=g}(\tau)\) - Effect at \(\tau\)-th quantile for the treated at time \(t\)
DTT\((t,y)\) \(= F_{Y_t(g)|G=g}(y) - F_{Y_t(\infty)|G=g}(y)\) - Distributional effects: How does the CDF shift?
DoTT\((t,y)\) \(= F_{Y_t(g)-Y_t(\infty)|G=g}(y)\) - Full distribution of unit-level effects
CATT\((t,x)\) \(= \mathbb{E}[Y_t(g) - Y_t(\infty) | G=g, X=x]\) - Heterogeneous effects by pre-treatment covariates
Key insight: Panel data enriches all parameters with time dimension \(t\). Each requires additional assumptions and estimation strategies.
With multiple post-treatment periods, we have many \(ATT(t)\)’s:
| \(t=1\) | \(t=2\) | \(t=3\) | \(t=4\) | \(t=5\) | |
|---|---|---|---|---|---|
| \(g=3\) | — | — | \(ATT(3)\) | \(ATT(4)\) | \(ATT(5)\) |
Summary measures:
Setting:
Some Empirical Examples:
Treatment sequence determined by \(G_i\):
Since treatment stays on, potential outcomes indexed by first treatment time:
Group-time ATT (Callaway and Sant’Anna 2021): \[ATT(g, t) = \mathbb{E}[Y_t(g) - Y_t(\infty) \mid G = g]\]
Building blocks: The \(ATT(g,t)\)’s are fundamental parameters. We can aggregate them in different ways to answer different research questions.
Example: \(T = 5\) periods, groups \(g \in \{2, 3, 4, 5, \infty\}\)
| \(t=2\) | \(t=3\) | \(t=4\) | \(t=5\) | |
|---|---|---|---|---|
| \(g=2\) | \(ATT(2,2)\) | \(ATT(2,3)\) | \(ATT(2,4)\) | \(ATT(2,5)\) |
| \(g=3\) | — | \(ATT(3,3)\) | \(ATT(3,4)\) | \(ATT(3,5)\) |
| \(g=4\) | — | — | \(ATT(4,4)\) | \(ATT(4,5)\) |
| \(g=5\) | — | — | — | \(ATT(5,5)\) |
10 parameters! And this is just 5 periods…
Question: Which aggregation should you use? It depends on your research question!
We have many \(ATT(g,t)\) parameters. How do we aggregate them?
Four main approaches, each answering a different question:
Cohort Heterogeneity \(\theta_S(g)\) “How does the average effect differ for early vs. late adopters?”
Calendar Time \(\theta_C(t)\) “What is the overall policy average effect at time \(t\)?”
Event-Study \(\theta_D(e)\) “How do average effects evolve with exposure (e.g., 1 year post, 2 years post)?”
Overall Average \(\theta^O\) “What is a single summary number for the entire policy?”
Key insight: All four are valid! Your choice depends on your research question.
General form: \(\displaystyle\sum_{g=2}^{T} \sum_{t=2}^{T} \mathbf{1}\{g \leq t\} \, w_{g,t} \, ATT(g,t)\)
Simple aggregations (assuming no-anticipation):
Unweighted average: \[\theta_M^O := \frac{2}{T(T-1)} \sum_{g,t: g \leq t} ATT(g,t)\]
Weighted by group size: \[\theta_W^O := \frac{1}{\kappa} \sum_{g,t: g \leq t} ATT(g,t) \cdot P(G = g | G \neq \infty)\]
Problem: These “overweight” units that have been treated longer (early adopters contribute more)!
Aggregation Strategy 1:
Cohort Heterogeneity
Focus: Do early vs. late adopters experience different average effects?
Average effect for units in group \(g\):
\[\theta_S(g) = \frac{1}{T - g + 1} \sum_{t=g}^{T} ATT(g,t)\]
Question: “Do early Medicaid expanders benefit more than late expanders?”
Aggregation Strategy 2:
Calendar Time Heterogeneity
Focus: What is the overall average effect at a specific calendar time?
Average effect at time \(t\) for all treated:
\[\theta_C(t) = \sum_{g \leq t} ATT(g,t) \cdot P(G = g | G \leq t)\]
Question: “What was the overall policy average impact among treated in 2020?”
Aggregation Strategy 3:
Event-Study / Dynamic Effects
Focus: How do average effects evolve with time since treatment?
Average effect at exposure \(e = t - g\):
\[\theta_D(e) = \sum_{g: g+e \leq T} ATT(g, g+e) \cdot P(G = g | G+e \leq T)\]
Question: “How does the effect evolve since adoption?” (Wolfers 2006)
Event-study aggregations \(\theta_D(e)\) are plotted against event time \(e = t - g\):
Typical features:
Questions: Why are these pre-treatment parameters zero? Do they need to be zero?
Aggregation Strategy 4:
Overall Summary (Scalar)
Focus: What is one single number summarizing the entire policy effect?
Sometimes we want a single number! Further aggregate the intermediate parameters:
Key insight: \(\theta^O_S\), \(\theta^O_C\), and \(\theta^O_D\) need not be equal! But we know how they relate via \(ATT(g,t)\).
Which to report? Depends on your research question:
Key Insight: All four are valid! They answer different questions. Choose based on whether you care about: (1) which groups, (2) which time periods, (3) dynamics, or (4) overall summary.
Question for you:
Is \(ATT(g,t) = \mathbb{E}[Y_t(g) - Y_t(\infty) | G = g]\) the only causal parameter we can define?
Obviously not! Remember our cross-sectional parameters? They all extend to \((g,t)\):
The \(ATT(g,t)\) framework is a template — swap in your parameter of interest!
\(ATT(g,t)\) always conditions on \(G = g\). But we can target other populations:
ATU\((g,t)\) — Effect for the never-treated: \[ATU(g,t) = \mathbb{E}[Y_t(g) - Y_t(\infty) \mid G = \infty]\]
ATE\((g,t)\) — Effect for everyone: \[ATE(g,t) = \mathbb{E}[Y_t(g) - Y_t(\infty)]\]
Key point: All compare treatment at \(g\) vs. never-treated (\(\infty\)). But who are we averaging over? Treated (\(G=g\)), never-treated (\(G=\infty\)), or everyone?
Why always compare to never-treated? We can compare different treatment timings:
Example: What’s the effect of adopting Medicaid in 2014 vs. 2016?
Takeaway: The potential outcomes framework is flexible! Define the comparison that answers your research question.
A hidden assumption in everything we’ve done:
Once treated, always treated.
(Treatment is absorbing)
Staggered adoption allows:
Staggered adoption forbids:
Many real-world treatments are not absorbing:
Democracy — Countries democratize and experience reversals (Acemoglu et al. 2019)
Policy adoption — States adopt minimum wages, then repeal them
Program participation — Workers enter and exit job training
Medical treatments — Patients start and stop medications
The question: How do we define potential outcomes and causal parameters when treatment sequences can be any pattern of 0s and 1s?
Single treatment time and Staggered are special cases of a more general setup:
Let \(D_{it} \in \{0,1\}\) be treatment status for unit \(i\) at time \(t\)
Treatment sequence: \(\mathbf{d} = (d_1, d_2, \ldots, d_T)\) where each \(d_t \in \{0,1\}\)
Examples with \(T=4\):
Single treatment time: Only \((0,\ldots,0,1,\ldots,1)\) at fixed \(g\) or \((0,\ldots,0)\)
Staggered: Only \((0,\ldots,0,1,\ldots,1)\) at varying \(g\) or \((0,\ldots,0)\)
General: Any sequence \(\mathbf{d} \in \{0,1\}^T\)
General notation (encompasses Single treatment time and Staggered):
How our earlier notation fits:
Key insight: Robins notation handles ALL cases—Single treatment time and Staggered are special cases where sequences have a simple structure.
Building Block causal parameter: \(ATT(\mathbf{d}, t) = \mathbb{E}[Y_t(\mathbf{d}) - Y_t(\infty) \mid G = \mathbf{d}]\)
Connection to what we learned:
With \(T\) periods: up to \(2^{T-1}\) potential treatment sequences! Not all need to exist, i.e., some \(\mathbf{d}\) may have zero probability.
When treatment can turn on AND off:
Example: Democracy & Growth (Acemoglu et al. 2019)
Same \(G^{start}\), very different histories! Grouping by “when first treated” lumps together stable democracies, brief episodes, and on-off patterns. (Illustrative example)
With \(T\) periods (no one treated in \(t=1\)): \(2^{T-1}\) possible treatment sequences
Example with \(T=4\):
Staggered adoption:
Treatment on/off:
The practical challenge:
Common approach: Aggregate by “time since first treated” (\(e = t - g^{start}\))
Define \(\widetilde{ATT}(g^{start}, t)\) = weighted average of \(ATT(\mathbf{d}, t)\) across all sequences \(\mathbf{d}\) that start treatment at \(g^{start}\).
Democracy example: What does \(\widetilde{ATT}(g^{start}=2, t=4)\) aggregate?
The interpretation problem: \(\widetilde{ATT}(g^{start}, t)\) mixes units with very different treatment histories! At \(t=4\), some countries had 3 years of democracy, others had 1, others had 2 with a gap.
This is hard to rationalize if democratization affects GDP not only contemporaneously but also via duration and persistence, i.e, in the presence of carryover effects.
Several papers discuss “staggerizing” treatment using \(G^{start}\) (Sun and Abraham 2021; Chaisemartin and D’Haultfœuille 2024): \[\widetilde{\theta}_D(e) = \sum_{g^{start}} w(g^{start}; e) \cdot \widetilde{ATT}(g^{start}, g^{start} + e)\] where \(w(g^{start}; e) = P(G^{start} = g^{start} \mid G^{start} + e \leq T)\) is the cohort share among units first-treated at least \(e\) periods ago.
Unfortunately, \(\widetilde{\theta}_D(e)\) does NOT represent an average effect w.r.t. length of exposure.
Why not?
Democracy: “2 years since first democratized” includes stable democracies AND countries that reverted.
Staggered Adoption
✓ Clear interpretation: \(e=2\) means 2 periods of treatment
Treatment On/Off
✗ Unclear interpretation: \(e=2\) mixes different exposures
With treatment on/off, event-study coefficients aggregate incomparable treatment histories.
Several papers discuss potential ways to move forward in this setting (Chaisemartin and D’Haultfœuille 2020, 2024; Liu, Wang, and Xu 2024)
These solutions often involve restricting treatment effect dynamics by imposing limited/no-carryover conditions:
These assumptions rule out long-run treatment effect dynamics
Key insight: The Robins (1986) framework handles treatment on/off, but aggregation and interpretation become much harder without restricting how past treatments affect current outcomes.
Everything discussed generalizes to multi-valued and continuous treatments
In staggered adoption, potential outcomes indexed by \((g, d)\):
The potential outcomes framework extends naturally:
Same principles for causal parameters and aggregations apply
See Callaway, Goodman-Bacon, and Sant’Anna (2024) for details
Setting: States adopt different minimum wage levels at different times
Causal parameters: \(ATT(g, d, t)\) — effect of adopting wage \(d\) at time \(g\), measured at \(t\)
Potential outcomes provide a unified framework for defining causal effects in panel data
Treatment timing matters: Single treatment time \(\to\) Staggered adoption \(\to\) Treatment on/off
Building blocks: \(ATT(g,t)\) parameters capture group-time specific effects
Aggregation: Cohort \(\theta_S(g)\), calendar time \(\theta_C(t)\), and event-study \(\theta_D(e)\) answer different questions
Flexibility vs. tractability: More general treatment patterns are hard to learn and aggregate absent additional assumptions (e.g., limited/no-carryover)
The framework is a template: Extends to QTT, DTT, CATT, ATU, ATE, and continuous treatments
Start with your empirical setting, then choose the appropriate framework!
Choose an empirical panel data paper and analyze its causal framework:
What is the treatment? Is it binary, multi-valued, or continuous?
What treatment pattern applies? Single treatment time, staggered adoption, or on/off?
What are the potential outcomes? Write them out explicitly.
What causal parameter is the paper targeting? \(ATT(g,t)\)? An aggregation?
Could the paper benefit from alternative parameters (e.g., event-study, heterogeneity by covariates)?
This exercise builds intuition for connecting empirical questions to the potential outcomes framework.
Questions?
Next: Randomizing Treatment Sequences

ECON 730 | Causal Panel Data | Pedro H. C. Sant’Anna