Search code examples
rregressionpanelcategorical-data

Predicting QUICKLY using a regression with a large number of fixed effects


I want to use R to estimate a regression with a very large number of fixed effects.

I then what to use that regression to predict with a test data set.

However, this needs to be done very quickly because I want to bootstrap my standard errors and do this many times.

I know the lfe package in R can do this. For example

reg=felm(Y~1|F1 + F2,data=dat)

Where dat is the data, F1,F2 are columns of categorical variables (the fixed effects to be included).

predict(reg,dat2), however, does not work with the lfe package...as has been discussed here.

Unfortunately lm is too slow as I have a very large numbers of fixed effects.


Solution

  • The way to speed this up is to extract the coefficients and perform the matrix operations manually. E.g.:

    xtrain <- data.frame(x1=jitter(1:1000), x2=runif(1000), x3=rnorm(1000))
    xtest <- data.frame(x1=jitter(1:1000), x2=runif(1000), x3=rnorm(1000))
    y <- -(1:1000)
    fit <- lm(y ~ x1 + x2 + x3, data=xtrain)
    
    beta <- matrix(coefficients(fit), nrow=1)
    xtest_mat <- t(data.matrix(cbind(intercept=1, xtest)))
    predictions <- as.vector(beta %*% xtest_mat)
    
    library(microbenchmark)
    microbenchmark(as.vector(beta %*% xtest_mat),
                   predict(fit, newdata = xtest))
    
    Unit: microseconds
                              expr     min       lq      mean  median      uq      max neval cld
     as.vector(beta %*% xtest_mat)   8.140  10.0690  13.12173  12.372  15.852   26.292   100  a 
     predict(fit, newdata = xtest) 635.413 657.2515 745.94840 673.009 763.166 2363.065   100   b
    

    So you can see that direct matrix multiplication is ~50x faster than the predict function.