Search code examples
rdistributed

Why are standard errors the same in lag-distributed model in R?


I am running a lag-distributed model ordinary least squares in which a set of units are all treated in the same year. I am including a 2 lags and 2 leads to see if there were any "anticipation" or "long-term" effects by generating the same treatment status in the previous and following years. However, when I run the model all of the standard errors are the same. For example:

set.seed(123)

data <- data.frame(y = rnorm(100), id = sort(rep(1:10,10)), years = rep(2000:2009,10))

tr <- ifelse(data$years == 2005 & data$id %in% c(2,3,4,5,6),1,0)
lead.2003 <- ifelse(data$year == 2003 & data$id %in% c(2,3,4,5,6), 1, 0)
lead.2004 <- ifelse(data$year == 2004 & data$id %in% c(2,3,4,5,6), 1, 0)
lag.2006 <- ifelse(data$year == 2006 & data$id %in% c(2,3,4,5,6), 1, 0)
lag.2007 <- ifelse(data$year == 2007 & data$id %in% c(2,3,4,5,6), 1, 0)

data <- cbind(data, tr, lead.2003, lead.2004, lag.2006 , lag.2007)

summary(lm(y ~ tr + lead.2003 + lead.2004 + lag.2006 + lag.2007, data = data))

Any ideas why this is? Thanks.


Solution

  • The variance of your tr/lead/lag variables is the same and they are orthogonal, so they end up splitting the mean square error evenly. The SE is calculated by sigma^2 * (X'X)^-1. Take a look at the inverse diagonal of the covariance matrix ( (X'X)^-1 in the formula ).

    fit <- lm(y ~ tr + lead.2003 + lead.2004 + lag.2006 + lag.2007, data = data)
    
    X <- as.matrix(cbind(1, data[,4:ncol(data)]))
    diag(solve(t(X) %*% X))
    #          1         tr  lead.2003  lead.2004   lag.2006   lag.2007 
    # 0.01333333 0.21333333 0.21333333 0.21333333 0.21333333 0.21333333 
    

    so, when you multiply this by the mean square error (sigma^2 in the formula, we can get this from an anova), and take the square root to get the standard error

    mse <- anova(fit)[[3]][6]            # mean square error
    sqrt(diag(mse * solve(t(X) %*% X)))  # SE
    #         1        tr lead.2003 lead.2004  lag.2006  lag.2007 
    # 0.1059099 0.4236397 0.4236397 0.4236397 0.4236397 0.4236397 
    

    you end up with the same standard errors for them all.

    You would get different standard errors if the variables had different variances, for example, changing lead.2003, and rerunning the same model (the variables are still orthogonal)

    ## changed 2:6 to 3:6
    lead.2003 <- ifelse(data$year == 2003 & data$id %in% 3:6, 1, 0)
    ## ... rerun model ...
    summary(fit)$coefficients[,2]  # standard errors
    # (Intercept)          tr   lead.2003   lead.2004    lag.2006    lag.2007 
    #   0.1048613   0.4220586   0.4689540   0.4220586   0.4220586   0.4220586 
    

    Or, if the variables weren't orthogonal the standard errors would be different.

    ## again changing lead.2003
    lead.2003 <- sample(c(rep(1,5), rep(0,95)), 100)
    ## .. rerun ..
    summary(fit)$coefficients[,2]
    # (Intercept)          tr   lead.2003   lead.2004    lag.2006    lag.2007 
    #   0.1057283   0.4272960   0.4316341   0.4316341   0.4272960   0.4272960