Search code examples
rforeachparallel-processingdoparallellibgee

How do I loop through a list of formulas in GEE model in R?


I am trying to run a set of GEE models. However, each model takes about an hour to an hour and a half to run. This wouldnt normally be an issue, but given the amount of models i am trying to run, I am currently trying to implement parallel processing of these GEE models in hopes I can cut down on the overall run time.

In my attempts to do so however, I am facing an error that I believe is related to the gee model function I am using itself, geeglm() from the geepack package. The error states:

Incompatible types (from language to character) in subassignment fix

Based on this other post, it appears that the error is coming from the formula is text and would be fixed if I just typed the formula outside of "" as normal: geeglm giving error "incompatible types (from language to character) in subassignment type fix"

however I am trying to loop through a list of formulas. Other posts I've seen involve looping through a list of variables in a glm, not necessarily the entire formula.

Here is my code:

library(parallel)
library(doParallel)
library(tidyverse)
library(geepack)
dt.model <- dt_long_cat

# Declare number of cluster to use
## Warning: more does NOT mean faster!!! As the objects and data sets need to be copied to each cluster!!
cl<-makeCluster(10)

# All packages must be loaded in the parallel environment
clusterEvalQ(cl, {
   library(tidyverse)
   library(geepack)
 })

# export all objects to parallel environment - this is the slowest step
## if you have a large dataset they need to be exported to each cluster
clusterExport(cl
              , varlist=c("dt.model")
              , envir=environment())


models.to.run <-
  list("ga.m1"="log(ga) ~ exposure1 + mat_age+ par + season + bmi_cat + tobacco_use"
       ,"ga.m2"="log(ga) ~ exposure2 + mat_age+ par + season + bmi_cat + tobacco_use"
       ,"ga.m3"="log(ga) ~ exposure3 + mat_age+ par + season + bmi_cat + tobacco_use"
       ,"ga.me1"="log(ga) ~ exposure1 + mat_age+ par + season + bmi_cat + tobacco_use + education"
       ,"ga.me2"="log(ga) ~ exposure2 + mat_age+ par + season + bmi_cat + tobacco_use + education"
       ,"ga.me3"="log(ga) ~ exposure3 + mat_age+ par + season + bmi_cat + tobacco_use + education")

# here you are looping to run the models using multiple cores
out.list <-
  foreach(i=1:length(models.to.run)) %dopar% {
    models.to.run[[i]] %>%
    geeglm(formula=.,data=dt.model,id=cohort,family=gaussian,corstr="exch")  
  }


# Always remember to stopCluster - skipping this makes things very complicated
stopCluster(cl=cl)

out.list


Solution

  • You can perfectly list formulae.

    library(geepack)
    
    data(dietox)  ## load example data from `geepack` package
    
    ## make list of formulae
    mf_lst <- list(fo1=Weight ~ Cu + Time, 
                   fo2=Weight ~ Cu + Time + I(Time^2),
                   fo3=Weight ~ Cu + Time + I(Time^2) + I(Time^3))
    
    
    library(parallel)
    
    cl <- makeCluster(detectCores() - 1)
    clusterExport(cl, c('dietox', 'mf_lst'))
    clusterEvalQ(cl, library(geepack))
    
    res <- parLapply(cl, mf_lst, \(x) geeglm(x, data=dietox, id=Pig, family=gaussian(), corstr="exch"))
    
    stopCluster(cl)
    

    Gives

    lapply(res, \(x) coef(summary(x)))
    # $fo1
    #             Estimate Std.err     Wald Pr(>|W|)
    # (Intercept)   15.422  1.0250  226.373    0.000
    # CuCu035       -0.835  1.5643    0.285    0.593
    # CuCu175        1.773  1.8766    0.893    0.345
    # Time           6.943  0.0796 7604.531    0.000
    # 
    # $fo2
    #             Estimate Std.err    Wald Pr(>|W|)
    # (Intercept)   18.500  1.0238 326.525 0.00e+00
    # CuCu035       -0.850  1.5633   0.296 5.87e-01
    # CuCu175        1.766  1.8752   0.887 3.46e-01
    # Time           5.624  0.1914 863.411 0.00e+00
    # I(Time^2)      0.102  0.0131  60.010 9.44e-15
    # 
    # $fo3
    #             Estimate Std.err   Wald Pr(>|W|)
    # (Intercept)  21.1275 1.03471 416.93 0.00e+00
    # CuCu035      -0.8425 1.56393   0.29 5.90e-01
    # CuCu175       1.7698 1.87586   0.89 3.45e-01
    # Time          3.5876 0.30194 141.18 0.00e+00
    # I(Time^2)     0.4786 0.04854  97.24 0.00e+00
    # I(Time^3)    -0.0194 0.00242  64.01 1.22e-15