Controlling time-varying confounding in difference-in-differences studies using the time-varying treatments framework

2.1 Treatment strategies and potential outcome notation

The TVT framework is concerned with estimating the joint effect of a series of treatments on a final outcome. Let random variable \(D_t\) denote treatment at time t, and let random vector \(}_T = (D_1, \ldots , D_T)\) denote a series of treatments through final time period T. This series of treatments is also called a treatment history or treatment strategy. Let \(d_t\) and \(}_T\) be specific realizations of \(D_t\) and \(}_T\) respectively. When referring to the entire treatment history through the final time point T, the T subscript can be dropped. (That is, \(}\) and \(}\) refer to the full treatment history/strategy.) For example, \(} = (0, \ldots , 0)\) is the “never-treat” strategy, and \(} = (1, \ldots , 1)\) is the “always-treat” strategy. These two strategies are examples of static treatment strategies: the series of treatments is a fixed sequence that does not depend on covariates. It is also possible to specify dynamic treatment strategies in which \(D_t\) is a function of (possibly time-varying) covariates. However, this article will focus on static strategies, as static strategies have direct parallels with commonly used DiD estimands. (I will comment on the potentials for studying dynamic strategies in policy contexts in the Discussion.)

Let \(Y_t(})\) denote the potential outcome at time t under treatment strategy \(}\). The overbar notation is also used to denote covariate histories. For a set of covariates \(}_t = (X_, \ldots , X_)\) measured at time t, let \(\overline}}_t\) denote the covariate history at time t consisting of the sequence of covariate values for each covariate from time 1 to time t.

2.2 Causal estimands

In the TVT framework, causal effects are defined as contrasts of potential outcomes under arbitrary treatment strategies \(}\) and \(}^*\). While the average treatment effect across all units (ATE) \(E[Y_t(}) - Y_t(}^*)]\) is most common in the TVT literature, average treatment effects on the treated (ATTs) may also be of interest.

Typical DiD estimands can be framed in terms of treatment strategy contrasts. In the canonical DiD setup where there are two groups of units (treated and control) measured over two time periods (\(t=1,2\)), two particular static treatment strategies are compared: \(} = (0, 0)\) (control group) and \(}^* = (0, 1)\) (treatment group). Generally, the causal estimand of interest is the average effect of treatment on the treated (ATT) in time period 2: \(E[Y_2(0,1) - Y_2(0,0) \mid } = (0,1)]\).

A common variation on the canonical DiD setup is the staggered rollout (multiple time period) design. In this design, there are multiple time periods in which units become and remain treated (reflecting the common viewpoint of treatment as an absorbing state). As in the canonical setting, ATT estimands are typically of interest and can be expressed in terms of treatment strategy contrasts. Here, treatment strategies can be indexed by the time period g in which units are first treated (referred to as a treatment “group”). Letting T be the total number of time periods, the static treatment strategies of interest are \(}^* = (}_, }_)\) for \(g = 1, \ldots , T\), where \(}_a\) and \(}_a\) are vectors of zeroes and ones repeated a times. The “never-treated” strategy \(} = }_T\) typically forms the reference strategy in causal contrasts. Let G denote the time at which a unit was first treated. The recent framework of Callaway and Sant’Anna (2021) (henceforth referred to as the “CS2021 framework”) centers the group-time ATT as the causal estimand of interest:

$$\begin ATT(g,t) = E[Y_t(}^*) - Y_t(}_T) \mid G = g], \qquad G = 2,\ldots ,T. \end$$

(1)

The group-time ATT represents the causal effect for a given treatment timing group g in a given time period t. In the absence of anticipatory effects of treatment (potential outcomes being affected by future treatment), interest generally lies in estimating these group-time effects for \(t \ge g\). (Researchers may be interested in checking for the absence of effects at times \(t < g\) to check the plausibility of the parallel trends assumption.)

Note that the CS2021 framework does not estimate causal effects in the first time period. Units treated in the first time period are dropped from the analysis because, under the parallel trends assumption, untreated potential outcomes are never observed for these units: units treated in the first time period never have observed outcomes from time periods in which they were not treated. Under alternate (TVT framework) assumptions, causal effects for units treated in the first time period can be identified.

2.3 Identification assumptions

In order to identify causal effects, the TVT framework relies on an assumption called sequential (conditional) exchangeability (also called ignorability), which states that potential outcomes are conditionally independent of treatment at a particular time k given prior treatment history \(}_\) and covariate history \(\overline}}_k\):

$$\begin Y_t(}) \perp \!\!\!\perp D_k \mid }_ = }_, \overline}}_k \qquad \text \quad k = 1,\ldots ,T. \end$$

(2)

We can think of sequential exchangeability as assuming a conditionally randomized experiment at each time point, where randomization depends on prior treatment and on covariates measured at prior and concurrent time points.

The identification of covariates needed for sequential exchangeability to hold has been aided by the use of causal graphs, which have a long history in biostatistics and epidemiology of clarifying the structural nature of biases in causal inference (Robins 2001). It is beyond the scope of this article to provide a detailed explanation of how to use causal graphs to identify suitable control variables, but I will provide a conceptual overview.

A causal graph depicts the (assumed) causal relationships between variables in the system under study. Examples are shown in Fig. 1. Strong background knowledge should guide construction of the graph (Ferguson et al. 2020; Rodrigues et al. June 2022). Paths between the treatment and outcome variables have the potential to create statistical associations between them. These paths can be classified as either causal or noncausal.

Causal paths are directed paths (arrows flow in one direction) between the treatment and outcome and generate causal effects. Causal paths that would remain even after application of an intervention should be left intact.

Noncausal paths are undirected paths (arrows do not all flow in the same direction) between the treatment and outcome. These paths create spurious associations that need to be removed. Identification of variables that suffice to “block” spurious association flow (meet sequential exchangeability) can be identified from noncausal paths using a graphical technique called d-separation (Pearl 1995; Robins 2001) on a graphical object related to the causal graph: a single-world intervention graph (SWIG) (Richardson and Robins 2013). A SWIG is very similar to a causal graph but displays potential outcomes (instead of observed outcomes) and represents variables in both pre- and post-intervention forms simultaneously. Full details on the graphical procedure to identify variables needed to achieve sequential exchangeability are given in Chapter 19 of Hernán and Robins (2021). Once variables that suffice to achieve sequential exchangeability are identified from the SWIG, they are used in estimation procedures described in the next section.

In contrast to sequential exchangeability, DiD methods rely chiefly on the parallel trends assumption, which states that the outcomes for the treated and control units would have evolved in parallel had the treated units remained untreated:

$$\begin E[Y_2(0,0) - Y_1(0,0) \mid } = (0,1)] = E[Y_2(0,0) - Y_1(0,0) \mid } = (0,0)] \end$$

(3)

The conditional parallel trends assumption states that the parallel trends assumption holds conditionally on covariates \(}\) (generally assumed to be time-invariant, often pre-treatment, covariates in the DiD literature):

$$\begin E[Y_2(0,0) - Y_1(0,0) \mid } = (0,1), }] = E[Y_2(0,0) - Y_1(0,0) \mid } = (0,0), }] \end$$

(4)

When there are multiple time periods, the conditional parallel trends assumption can be generalized to hold across all time windows with two time points (Callaway and Sant’Anna 2021).

The key difference between parallel trends and sequential exchangeability lies in the role of covariates. The sequential exchangeability assumption requires that all confounders be measured at each time point. Of note, this requires measuring both time-invariant and time-varying confounders. In this sense, sequential exchangeability is a stronger assumption than parallel trends because assuming parallel trends does not require measurement of confounders that only affect the level of the outcome and have constant composition between groups over time. The ability for parallel trends to handle time invariant unmeasured confounding is a major strength of the DiD design. However, economic and cultural factors that influence policy outcomes likely vary in composition over time, have time-varying effects on the outcome, and can be affected by treatment. In these cases, conditional parallel trends can be violated, but sequential exchangeability would allow for identification.

2.4 Modeling and estimation

In the TVT framework, inverse probability weighting (IPW) can be used to estimate causal effects, similar to their use in the DiD literature (Abadie 2005; Callaway and Sant’Anna 2021; Sant’Anna and Zhao 2020; Stuart et al. 2014). Central to IPW approaches in both literatures is the estimation of the propensity score, the probability of receiving treatment conditional on an appropriate set of covariates. In the TVT framework, these covariates are those sufficient for achieving sequential exchangeability, and in the DiD framework, they are the covariates needed for conditional parallel trends to hold. The key difference between IPW approaches in the TVT and DiD frameworks is the form of the propensity score and/or the way in which weights are used (discussed further in Sect. 2.5).

2.4.1 ATE weights

In the TVT framework, stabilized IP weights for ATE estimands are given by

$$\begin W^(},}) = \prod _^T \frac}_=}_)}}_=}_, \overline}}_)} \end$$

(5)

Essentially, the denominator of the weights is a unit’s probability of receiving their particular treatment strategy conditional on their treatment and covariate history. Stabilized IP weights have lower variance than their unstabilized counterparts which use a numerator of 1 instead of \(P(D_t=d_t \mid }_=}_)\).

Use of the ATE weights creates a pseudopopulation in which treatment indicators \(D_t\) are independent of confounders \(}\). As such, the mean potential outcome under any treatment strategy \(}\) is identified by the average outcome among units following this strategy in the weighted population (pseudopopulation). For example, \(E[Y_t(0,1)]\) is identified by \(E_[Y \mid D_1=0, D_2=1]\), where the latter expectation is with respect to the pseudopopulation created by the ATE weights. Essentially, we apply the weights and compute the mean outcome among the (reweighted) units following the treatment strategy of interest.

In practice, this is operationalized with regression models. For example, fitting the outcome regression model \(E[Y_2 \mid d_1, d_2] = \beta _0 + \beta _1 d_1 + \beta _2 d_2\) using weighted least squares (using TVT ATE weights) provides estimates of the mean potential outcome at \(t=2\) under any length 2 treatment strategy (e.g., \(\beta _0\) represents the mean potental outcome for the (0, 0) strategy and \(\beta _0 + \beta _1+\beta _2\) for the (1, 1) strategy). Outcome regression models fit with IP-weighted least squares are also called marginal structural models (MSMs).

2.4.2 ATT weights

Weights for ATT estimands (such as group-time ATTs) can be constructed in an analogous way to their construction in the single time point setting (treated units receiving a weight of 1 and control units receiving a weight equal to the conditional odds of treatment). However, in the multiple time period setting, we no longer have just one treated group; there are multiple groups of units who receive treatment (at different time points), and there are correspondingly different sets of ATT weights. Let \(}^*\) represent the treatment strategy for the treatment group of current interest, and let \(}\) represent the treatment strategy of the unit under consideration. The unnormalized ATT weights are given by:

$$\begin W^(}, }) = 1, & } = }^* \\ \prod _^T \frac}_=}^*_, \overline}}_)}}_=}_, \overline}}_)}, & } \ne }^* \quad \text \\ 0, & } \ne }^* \quad \text \end\right. } \end$$

(6)

Whether or not a unit is a “valid control” depends on (1) the time point t at which causal effects are desired and (2) the type of control units desired: never-treated units or not-yet-treated units. If using never-treated controls, only units following the \(}_T\) treatment strategy are valid controls. If using not-yet-treated controls, valid controls comprise the never-treated units and the units who have not yet initiated treatment by time t.

It is useful to make explicit the parallels between single time point ATT weights and the ATT weights described here:

In the single time point setting, treated units receive a weight of 1. Here, units following the treatment strategy of interest receive a weight of 1. The mean of \(Y_t\) among these units estimates \(E[Y_t(}^*) \mid }=}^*]\).

In the single time point setting, control units receive a weight equal to the conditional odds of treatment. That is, if \(p(})\) represents the propensity score, control units receive weights equal to \(p(})/(1-p(}))\). The \(1-p(})\) in the denominator reweights control units to the full sample; it effectively simulates all units in the sample receiving the control condition. The \(p(})\) in the numerator scales the weight back down to have just the treated units receiving the control condition, allowing for estimation of the treatment effect for just the treated units. Here, the denominator of the weights for valid controls effectively simulates all units receiving each of the control treatment strategies. The numerator scales the weight back down to have just the treated units following the control strategies. The weighted mean of \(Y_t\) among valid controls estimates \(E[Y_t(}) \mid }=}^*]\).

Normalized weights can be formed by normalizing the weights for control units to sum to the total number of treated units.

Similar to ATE weights, use of ATT weights creates a pseudopopulation in which treatment is independent of confounders. Unlike the ATE weights, the pseudopopulation created by the ATT weights cannot be used to estimate the marginal mean potential outcome under any treatment strategy; it can only be used to estimate the mean potential outcome under the treatment strategy for the treated group (\(}^*\)) and under the treatment strategies followed by valid controls among the units who followed the treated strategy \(}^*\).

Fitting MSMs at multiple time points parallels estimation goals in DiD settings–in particular, the estimation of group-time ATTs in the CS2021 framework. Estimating ATT(g, t) in the TVT (MSM) framework involves constructing TVT ATT weights for treatment strategy \(}^* = (}_, }_)\) and fitting the MSM \(E[Y_t] = \beta _0 + \beta _1 d_g\).

2.4.3 Weights for treatment initiation (absorbing treatments)

A slight modification of the ATE and ATT weights is needed when estimating the effect of initiating a treatment. This detail is relevant in DiD settings that consider the effect of an absorbing treatment. Once a unit is treated (treatment “switches on”), the conditional probability of treatment given treatment and covariate history is 1. That is, if a unit is first treated at time g, the terms in the product of (5) and (6) become 1 for \(t > g\). For example, the treatment strategy \(} = (0,1,1,1)\) in a 4-time period setting is effectively treated as the strategy \(} = (0,1)\) in constructing IP weights. In the remainder of this article, I will refer to the weights described above as TVT IP weights.

2.5 Comparison of TVT and DiD approaches to IPW

Given that IPW has a long history in the DiD literature (Abadie 2005; Callaway and Sant’Anna 2021; Sant’Anna and Zhao 2020; Stuart et al. 2014), it is useful to compare the details of IPW implementation in these frameworks and in the TVT framework.

The form of the TVT IP weights clarifies why these weights can handle time-varying confounding. Recall that the sequential exchangeability assumption can be interpreted as an assumption of a conditionally randomized experiment at each time point. Each term in the product form of the TVT IP weights reflects randomization at a single time point, where randomization can be influenced by any type of confounder: time-fixed or time-varying. Thus, application of the TVT IP weights creates a pseudopopulation in which treatment is marginally independent of the confounders and past treatment at each time point. This allows for direct comparison of mean outcomes under different treatment strategies in the weighted sample. In terms of the causal graph, IP weighting removes arrows from confounders to treatment at each time point, thus disabling noncausal paths.

In contrast, IPW strategies in the DiD literature have primarily used time-invariant (often pre-treatment) covariates in propensity score estimation. Thus, weighting creates a conditionally randomized experiment at the first time point, but not at subsequent time points and is thus unable to handle time-varying confounding.

Another key difference between TVT and DiD strategies for IPW is the quantity that is weighted. Abadie (2005); Callaway and Sant’Anna (2021); Sant’Anna and Zhao (2020) provide estimators that weight outcome change scores (e.g., \(Y_ - Y_\)) in addition to estimators that weight the outcomes themselves. The use of change score estimators has efficiency benefits (Sant’Anna and Zhao 2020) but requires the availability of panel data. Using the change score has the benefit of removing the effect of certain types of unmeasured confounders (confounders that do not evolve differently over time in the treatment and control groups and do not have a time-varying effect on the outcome) Because the TVT literature has focused on treatment effects at a single final time point, this useful feature of the change score seems to have been overlooked in TVT methodology. It is natural to consider combining TVT IP weighting with construction of change scores. I explore this in my simulation studies.

留言 (0)

沒有登入
gif