I am writing a function to fit many glm
models. To just give you some ideas about the function, I include a small section of my code. With the help of several SO users, the function works for my analysis purpose now. However, sometimes, particularly when the sample size is relatively small, it can take quite long time to finish the whole process.
To reduce the time, I am considering changing some details of iterative maximization, such as maximum number of iterations. I have not found a way to do it, maybe because I am still not familiar with R
terminology. Any suggestions to do this or other ways to reduce time would be appreciated.
all_glm <- function(crude, xlist, data, family = "binomial", ...) {
# md_lst include formula for many models to be fitted
comb_lst <- unlist(lapply(1:n, function(x) combn(xlist, x, simplify=F)), recursive=F)
md_lst <- lapply(comb_lst,function(x) paste(crude, "+", paste(x, collapse = "+")))
models <- lapply(md_lst, function(x) glm(as.formula(x), family = family, data = data))
OR <- unlist(lapply(models, function(x) broom::tidy(x, exponentiate = TRUE)$estimate[2]))
}
EDIT
Thanks to @BenBolker who directed me to the package fastglm
, I end up with several r
packages which could provide faster alternatives to glm
. I have tried fastglm
and speedglm
. It appears than both are faster than glm
on my machine.
library(fastglm)
library(speedglm)
# from
set.seed(1)
n <- 25000
k <- 500
y <- rbinom(n, size = 1, prob = 0.5)
x <- round( matrix(rnorm(n*k),n,k),digits=3)
colnames(x) <-paste("s",1:k,sep = "")
df <- data.frame(y,x)
fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+")))
# Fit three models:
system.time(m_glm <- glm(fo, data=df, family = binomial))
system.time(m_speedglm <- speedglm(fo, data= df, family = binomial()))
system.time(m_fastglm <- fastglm(x, y, family = binomial()))
> system.time(m_glm <- glm(fo, data=df, family = binomial))
user system elapsed
56.51 0.22 58.73
> system.time(m_speedglm <- speedglm(fo, data= df, family = binomial()))
user system elapsed
17.28 0.04 17.55
> system.time(m_fastglm <- fastglm(x, y, family = binomial()))
user system elapsed
23.87 0.09 24.12
The IRLS algorithm typically used for fitting glms requires matrix inversion/decomposition at each iteration. fastglm
offers several different options for the decomposition and the default choice is a slower but more stable option (QR with column-pivoting). If your only interest is speed, then one of the two available Cholesky-type decompositions will improve the speed dramatically, which would be more advisable than just changing the number of IRLS iterations. Another notable difference between fastglm
and standard IRLS implementations is its careful use of half-steps in order to prevent divergence (IRLS can diverge in practice in a number of cases).
The method
argument of fastglm
allows one to change the decomposition. option 2 gives the vanilla Cholesky decomposition and option 3 gives a slightly more stable version of this. On my computer, the timings for your provided example are:
> system.time(m_glm <- glm(fo, data=df, family = binomial))
user system elapsed
23.206 0.429 23.689
> system.time(m_speedglm <- speedglm(fo, data= df, family = binomial()))
user system elapsed
15.448 0.283 15.756
> system.time(m_fastglm <- fastglm(x, y, family = binomial(), method = 2))
user system elapsed
2.159 0.055 2.218
> system.time(m_fastglm <- fastglm(x, y, family = binomial(), method = 3))
user system elapsed
2.247 0.065 2.337
With regards to using broom with fastglm objects, I can look into that.
Another note about decompositions: When fastglm
uses the QR decomposition, it is working with the design matrix directly. Although speedglm
technically offers a QR decomposition, it works by first computing $X^TX$ and decomposing this, which is more numerically unstable than a QR on X.