Search code examples
rregressiondummy-variablepanel-dataplm

Is it ok to run a plm fixed effect model and add a factor dummy variable (tree way fixed effects)?


Is it ok to run a "plm" fixed effect model and add a factor dummy variable in R as below?

The three factors "Time", "Firm” and "Country" are all separate indices which I want to fix all together.

Instead of making two indices in total by combining "Firm” and "Country", I find the below specification works much better for my case.

Is this an acceptable format?

plm(y ~ lag(x1, 1) + x2 + x3 + x4 + x5 + factor(Country), data=DATA,
    index=c("Firm","Time"), model="within")

Solution

  • It is okay to add additional factors. We can prove this by calculating an LSDV model. As a preliminary note, you will of course need robust standard errors, usually clustered at the highest aggregate level, i.e. country in this case.

    Note: R >= 4.1 is used in the following.

    LSDV

    fit1 <- 
      lm(y ~ d + x1 + x2 + x3 + x4 + factor(id) + factor(time) + factor(country), 
         dat)
    lmtest::coeftest(
      fit1, vcov.=sandwich::vcovCL(fit1, cluster=dat$country, type='HC0')) |>
      {\(.) .[!grepl('\\(|factor', rownames(.)), ]}()
    #      Estimate Std. Error    t value      Pr(>|t|)
    # d  10.1398727  0.3181993 31.8664223 4.518874e-191
    # x1  1.1217514  1.6509390  0.6794627  4.968995e-01
    # x2  3.4913273  2.7782157  1.2566797  2.089718e-01
    # x3  0.6257981  3.3162148  0.1887085  8.503346e-01
    # x4  0.1942742  0.8998307  0.2159008  8.290804e-01
    

    After adding factor(country), the estimators we get with plm::plm are identical to LSDV:

    plm::plm

    fit2 <- plm::plm(y ~ d + x1 + x2 + x3 + x4 + factor(country), 
                     index=c('id', 'time'), model='within', effect='twoways', dat)
    summary(fit2, vcov=plm::vcovHC(fit2, cluster='group', type='HC1'))$coe
    #      Estimate Std. Error    t-value      Pr(>|t|)
    # d  10.1398727  0.3232850 31.3651179 5.836597e-186
    # x1  1.1217514  1.9440165  0.5770277  5.639660e-01
    # x2  3.4913273  3.2646905  1.0694206  2.849701e-01
    # x3  0.6257981  3.1189939  0.2006410  8.409935e-01
    # x4  0.1942742  0.9250759  0.2100089  8.336756e-01
    

    However, cluster='group' will refer to "id" and not to "country", so the standard errors are wrong. It seems that clustering by the additional factor with plm is currently not possible, at least I am not aware of anything.

    Alternatively you may use lfe::felm to not have to do without the immensely reduced computing times relative to LSDV:

    lfe::felm

    summary(lfe::felm(y ~ d + x1 + x2 + x3 + x4 | id + time + country | 0 | country,
                      dat))$coe
    #      Estimate Cluster s.e.    t value     Pr(>|t|)
    # d  10.1398727    0.3184067 31.8456637 1.826374e-33
    # x1  1.1217514    1.6520151  0.6790201 5.004554e-01
    # x2  3.4913273    2.7800267  1.2558611 2.153737e-01
    # x3  0.6257981    3.3183765  0.1885856 8.512296e-01
    # x4  0.1942742    0.9004173  0.2157602 8.301083e-01
    

    For comparison, here is what Stata computes, the standard errors closely resemble those of LSDV and lfe::felm:

    Stata

    . reghdfe y d x1 x2 x3 x4, absorb (country time id) vce(cluster country) 
    
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
               d |   10.13987   .3185313    31.83   0.000      9.49907    10.78068
              x1 |   1.121751   1.652662     0.68   0.501    -2.202975    4.446478
              x2 |   3.491327   2.781115     1.26   0.216    -2.103554    9.086209
              x3 |   .6257981   3.319675     0.19   0.851    -6.052528    7.304124
              x4 |   .1942742   .9007698     0.22   0.830    -1.617841    2.006389
           _cons |   14.26801   23.65769     0.60   0.549    -33.32511    61.86114
    

    Simulated Panel Data:

    n1 <- 20; t1 <- 4; n2 <- 48
    dat <- expand.grid(id=1:n1, time=1:t1, country=1:n2)
    set.seed(42)
    dat <- within(dat, {
      id <- as.vector(apply(matrix(1:(n1*n2), n1), 2, rep, t1))
      d <- runif(nrow(dat), 70, 80)
      x1 <- sample(0:1, nrow(dat), replace=TRUE)
      x2 <- runif(nrow(dat))
      x3 <- runif(nrow(dat))
      x4 <- rnorm(nrow(dat))
      y <-
        10*d +  ## treatment effect
        as.vector(replicate(n2, rep(runif(n1, 2, 5), t1))) +  ## id FE
        rep(runif(n1, 10, 12), each=t1) +  ## time FE
        rep(runif(n2, 10, 12), each=n1*t1) +  ## country FE
        - .7*x1 + 1.3*x2 + 2.4*x3 +
        .5 * x4 + rnorm(nrow(dat), 0, 50)
    })
    readstata13::save.dta13(dat, 'panel.dta')  ## for Stata