I am trying to run a basic regression model in R. Previously, I always used the lm()
function without any issues. However, my data frame is now too large for this function and my computer. After running the lm()
function for 30 minutes on my dataset without seeing any progress, I stopped the function and it crashed RStudio. The computer I am using has 24GB RAM.
My regression model is:
lm(y~var1+var2+var3+var4, data = df)
The data I am trying to run the lm() function on is:
n=100000
, with 4 independent variables (one numeric
, three factor
) and normally distributed.
I found out that using the glm4()
function (from the MatrixModels package) is a lot faster and does not crash R in my case. However, this function does not produce a summary table when calling it:
library(MatrixModels)
fit <- glm4(y~var1+var2+var3+var4, data = df, sparse = TRUE, family = gaussian)
summary(fit)
Length Class Mode
1 glpModel S4
Only calling coefficients using head(coef(fit))
does work, however, I would prefer a full summary table.
head(coef(fit))
I also saw this topic:
Is there a faster lm function, in which the functions lm.fit()
and .lm.fit()
are discussed, though the syntax and input (matrix) in these functions is different from the other functions. The function speedglm
from the speedglm
package returns an error in my case. Most topics on alternatives of the lm()
and glm()
function are also outdated.
What is the best way to run an lm()
on a large dataset currently?
Apparently, it should not be a problem to run a regression on a dataset of ~100,000 observations.
After receiving helpful comments on the main post, I found that one of the independent variables used in the input of the regression was coded as a character, by using the following command to find the data type of every column in the dataframe (df):
str(df)
$ var1 : chr "x1" "x2" "x1" "x1"
$ var2 : Factor w/ 2 levels "factor1" "factor2": 1 1 1 0
$ var3 : Factor w/ 2 levels "factorx" "factory": 0 1 1 0
$ var4 : num 1 8 3 2
Changing var1 to a factor variable:
df$var1 <- as.factor(df$var1)
After changing var1 to a factor variable, the regression indeed runs within a few seconds.