I want to use R
to estimate a regression with a very large number of fixed effects.
I then what to use that regression to predict with a test data set.
However, this needs to be done very quickly because I want to bootstrap my standard errors and do this many times.
I know the lfe
package in R
can do this. For example
reg=felm(Y~1|F1 + F2,data=dat)
Where dat is the data, F1,F2
are columns of categorical variables (the fixed effects to be included).
predict(reg,dat2)
, however, does not work with the lfe package...as has been discussed here.
Unfortunately lm
is too slow as I have a very large numbers of fixed effects.
The way to speed this up is to extract the coefficients and perform the matrix operations manually. E.g.:
xtrain <- data.frame(x1=jitter(1:1000), x2=runif(1000), x3=rnorm(1000))
xtest <- data.frame(x1=jitter(1:1000), x2=runif(1000), x3=rnorm(1000))
y <- -(1:1000)
fit <- lm(y ~ x1 + x2 + x3, data=xtrain)
beta <- matrix(coefficients(fit), nrow=1)
xtest_mat <- t(data.matrix(cbind(intercept=1, xtest)))
predictions <- as.vector(beta %*% xtest_mat)
library(microbenchmark)
microbenchmark(as.vector(beta %*% xtest_mat),
predict(fit, newdata = xtest))
Unit: microseconds
expr min lq mean median uq max neval cld
as.vector(beta %*% xtest_mat) 8.140 10.0690 13.12173 12.372 15.852 26.292 100 a
predict(fit, newdata = xtest) 635.413 657.2515 745.94840 673.009 763.166 2363.065 100 b
So you can see that direct matrix multiplication is ~50x faster than the predict function.