Search code examples
rformuladummy-variable

Formula with interaction terms in event-study designs using R


I am estimating what's often called the "event-study" specification of a difference-in-differences model in R. Basically, we observe treated and control units over time and estimate a two-way fixed effects model with parameters for the "effect" of being treated in each time period (omitting one period, usually the one before treatment, as the reference period). I am struggling with how to compactly specify this model with R formulas.

For example, here is the model...

library(lfe)
library(tidyverse)
library(dummies)

N <- 100

df <- tibble(
    id = rep(1:N, 5),
    treat = id >= ceiling(N / 2),
    time = rep(1:5, each=N),
    x = rnorm(5 * N)
)

# produce an outcome variable
df <- df %>% mutate(
    y = x - treat * (time == 5) + time + rnorm(5*N)
)

head(df)

# easily recover the parameters with the true model...
summary(felm(
    y ~ x + I(treat * (time == 5)) | id + time, data = df
))

Now, I want to do an event-study design using period 4 as the baseline because treatment happens in period 5. We expect coefficients near zero on the pre-periods (1–4), and a negative treatment effect for the treated in the treated period (time == 5)

df$timefac <- factor(df$time, levels = c(4, 1, 2, 3, 5))
summary(felm(
    y ~ x + treat * timefac | id + time, data = df
))

That looks good, but produces lots of NAs because several of the coefficients are absorbed by the unit and time effects. Ideally, I can specify the model without those coefficients...

# create dummy for each time period for treated units
tdum <- dummy(df$time)
df <- bind_cols(df, as.data.frame(tdum))
df <- df %>% mutate_at(vars(time1:time5), ~ . * treat)

# estimate model, manually omitting one dummy
summary(felm(
    y ~ x + time1 + time2 + time3 + time5 | id + time, data = df
))

Now, the question is how to specify this model in a compact way. I thought the following would work, but it produces very unpredictable output...

summary(felm(
     y ~ x + treat:timefac | id + time, data = df
))

With the above, R does not use period 4 as the reference period and sometimes chooses to include the interaction with untreated rather than treated. The output is...

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
x                    0.97198    0.05113  19.009  < 2e-16 ***
treatFALSE:timefac4       NA         NA      NA       NA    
treatTRUE:timefac4  -0.19607    0.28410  -0.690  0.49051    
treatFALSE:timefac1       NA         NA      NA       NA    
treatTRUE:timefac1  -0.07690    0.28572  -0.269  0.78796    
treatFALSE:timefac2       NA         NA      NA       NA    
treatTRUE:timefac2        NA         NA      NA       NA    
treatFALSE:timefac3  0.15525    0.28482   0.545  0.58601    
treatTRUE:timefac3        NA         NA      NA       NA    
treatFALSE:timefac5  0.97340    0.28420   3.425  0.00068 ***
treatTRUE:timefac5        NA         NA      NA       NA    

Is there a way to specify this model without having to manually produce dummies and interactions for treated units for every time period?

If you know Stata, I'm essentially looking for something as easy as:

areg y x i.treat##ib4.time, absorb(id)

(Note how simple it is to tell Stata to treat the variable as categorical — the i prefix —without making dummies for time and also indicate that period 4 should be the base period — the b4 prefix.)


Solution

  • You can redefine the timefac so that untreated observations are coded as the omitted time category.

    df %>% 
      mutate(time = ifelse(treat == 0, 4, time),
             timefac = factor(time, levels = c(4, 1, 2, 3, 5)))
    

    Then, you can use timefac without interactions and get a regression table with no NAs.

    summary(felm(
      y ~ x + timefac | id + time, data = df
    ))
    
    Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
    x          0.98548    0.05028  19.599  < 2e-16 ***
    time_fac1 -0.01335    0.27553  -0.048    0.961    
    time_fac2 -0.10332    0.27661  -0.374    0.709    
    time_fac3  0.24169    0.27575   0.876    0.381    
    time_fac5 -1.16305    0.27557  -4.221 3.03e-05 ***
    

    This idea came from: https://blogs.worldbank.org/impactevaluations/econometrics-sandbox-event-study-designs-co