r lm

Optimize the time to run fixed effects in an R lm() model

I am trying to run a regression model that includes fixed effects for cities in the united states. I have over 10,000,000 million rows and 600 cities. The code below works, but it is really slow. When including a factor for a variable with lots of levels, is there any way to run the model faster.

x <- data.frame(
    a = sample( 1:1000, 1000000 , replace=T),
    cityfips = sample( 1:250, 1000000 , replace=T),
    d = sample( 1:4, 1000000 , replace=T)
)

system.time(a1 <- lm( a~cityfips+d  , x ) )
system.time(a2 <- lm( a~as.factor(cityfips) + d  , x ) )




> system.time(a1 <- lm( a~cityfips+d  , x ) )
   user  system elapsed 
   0.22    0.00    0.22 
> system.time(a2 <- lm( a~as.factor(cityfips) + d  , x ) )
   user  system elapsed 
  95.65    0.97   96.62 
> system.time(a3 <- slm( a~as.factor(cityfips) + d  , x ) )
   user  system elapsed 
   4.58    2.06    6.65

Solution

When you have that many factors, constructing the model.matrix in lm() will take up most of the time, one way is to use sparseMatrix like in glmnet and there are two packages, sparseM, MatrixModels that allows lm onto sparseMatrix:

set.seed(111)
x <- data.frame(
    a = sample( 1:1000, 1000000 , replace=T),
    cityfips = sample( 1:250, 1000000 , replace=T),
    d = sample( 1:4, 1000000 , replace=T)
)

library(SparseM)
library(MatrixModels)
library(Matrix)

system.time(f_lm <- lm( a~as.factor(cityfips) + d  , x )  )
   user  system elapsed 
 75.720   2.494  79.365  
system.time(f_sparseM <- slm(a~as.factor(cityfips) + d  , x ))
   user  system elapsed 
  5.373   3.952  10.646
system.time(f_modelMatrix <- glm4(a~as.factor(cityfips) + d  ,data=x,sparse=TRUE))
   user  system elapsed 
  1.878   0.335   2.219

The closest I can find is glm4 in MatrixModels, but you can see below the coefficients are the same as the fit using lm:

all.equal(as.numeric(f_sparseM$coefficients),as.numeric(f_lm$coefficients))
[1] TRUE
all.equal(as.numeric(f_lm$coefficients),as.numeric(coefficients(f_modelMatrix)))
[1] TRUE

One other option besides glm4 in MatrixModels is to use lm.fit (as pointed out by @BenBolker:

lm.fit(x=Matrix::sparse.model.matrix(~as.factor(cityfips) + d,data=x),y=x$a)

This gives you a list as like lm.fit() normally and you cannot apply functions such as summary() etc.

Authors of both package warn about it being experimental so there might still be some differences compared to stats::lm , take care to check.