Search code examples
rforeachparallel-processingmodelinggam

Parallel foreach changes data to NA when running dropterm


I have a GAMLSS model I'm trying to fit to multiple subsets of my data. Each month needs to be analyzed separately, so I'm using a foreach loop to iterate through the months. However, when I parallelize my loop, the results of dropterm all get NA'd. Here's a similar example using built-in data:

library(dplyr)
library(gamlss)
library(MASS)
nCores <- detectCores()
gamlssCl <- makeCluster(nCores)
registerDoParallel(gamlssCl)
test.par <- foreach(s = unique(iris$Species), 
                    .packages = c('dplyr', 'gamlss', 'MASS')) %dopar% {
  species.data <- filter(iris, Species == s)
  model <- gamlss(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Length, 
                  data = species.data, 
                  family = GA)
  var.rank <- dropterm(model, test = 'Chisq') %>%
    mutate(Variable = row.names(.)) %>% 
    arrange(AIC) %>%
    filter(Variable != '<none>')

  var.rank
}
stopCluster(gamlssCl)
test.par
# [[1]]
# Df AIC LRT Pr(Chi)     Variable
# 1 NA  NA  NA      NA Sepal.Length
# 2 NA  NA  NA      NA  Sepal.Width
# 3 NA  NA  NA      NA Petal.Length
# 
# [[2]]
# Df AIC LRT Pr(Chi)     Variable
# 1 NA  NA  NA      NA Sepal.Length
# 2 NA  NA  NA      NA  Sepal.Width
# 3 NA  NA  NA      NA Petal.Length
# 
# [[3]]
# Df AIC LRT Pr(Chi)     Variable
# 1 NA  NA  NA      NA Sepal.Length
# 2 NA  NA  NA      NA  Sepal.Width
# 3 NA  NA  NA      NA Petal.Length

test.serial <- foreach(s = unique(iris$Species)) %do% {
  species.data <- filter(iris, Species == s)
  model <- gamlss(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Length, 
                  data = species.data, 
                  family = GA)
  var.rank <- dropterm(model, test = 'Chisq') %>%
    mutate(Variable = row.names(.)) %>% 
    arrange(AIC) %>%
    filter(Variable != '<none>')

  var.rank
}
test.serial
# [[1]]
# Df       AIC        LRT   Pr(Chi)     Variable
# 1  1 -31.66335 0.06406465 0.8001832  Sepal.Width
# 2  0 -29.72741 0.00000000        NA Petal.Length
# 3  1 -29.43731 2.29010516 0.1302011 Sepal.Length
# 
# [[2]]
# Df      AIC       LRT      Pr(Chi)     Variable
# 1  0 31.03608  0.000000           NA Petal.Length
# 2  1 33.81852  4.782442 2.875132e-02  Sepal.Width
# 3  1 56.00459 26.968510 2.067972e-07 Sepal.Length
# 
# [[3]]
# Df      AIC         LRT      Pr(Chi)     Variable
# 1  1 16.29265  0.08628226 7.689578e-01  Sepal.Width
# 2  0 18.20637  0.00000000           NA Petal.Length
# 3  1 77.14978 60.94341742 5.873901e-15 Sepal.Length

Note: The error doesn't manifest when using glm instead of gamlss


Solution

  • Sorry, no solution yet, but here's a minimal example that illustrate the issue and that doesn't depend on foreach.

    First, do:

    library("gamlss")
    data <- subset(iris, Species == "setosa")
    model <- gamlss(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Length, 
                    data = data, family = GA)
    ## GAMLSS-RS iteration 1: Global Deviance = -37.7274 
    ## GAMLSS-RS iteration 2: Global Deviance = -37.7274
    
    model2 <- dropterm(model, test = "Chisq")
    print(model2)
    ## Single term deletions for
    ## mu
    ## 
    ## Model:
    ## Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Length
    ##              Df     AIC     LRT Pr(Chi)
    ## <none>          -29.727                
    ## Sepal.Length  1 -29.437 2.29011  0.1302
    ## Sepal.Width   1 -31.663 0.06406  0.8002
    ## Petal.Length  0 -29.727 0.00000
    

    and then save the results to file:

    saveRDS(list(model = model, model2 = model2), file = "gamlss.rds")
    

    Then in a fresh R session (R --vanilla), do:

    > library("gamlss")
    Loading required package: splines
    Loading required package: gamlss.data
    Loading required package: gamlss.dist
    Loading required package: MASS
    Loading required package: nlme
    Loading required package: parallel
     **********   GAMLSS Version 5.0-1  ********** 
    For more on GAMLSS look at http://www.gamlss.org/
    Type gamlssNews() to see new features/changes/bug fixes.
    
    > gamlss <- readRDS("gamlss.rds")
    > model <- gamlss$model
    > class(model)
    [1] "gamlss" "gam"    "glm"    "lm"   
    
    > model2 <- dropterm(model, test = "Chisq")
    Model with term  Sepal.Length has failed 
    Model with term  Sepal.Width has failed 
    Model with term  Petal.Length has failed 
    
    > print(model2)
    Single term deletions for
    mu
    
    Model:
    Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Length
                 Df     AIC LRT Pr(Chi)
    <none>          -29.727            
    Sepal.Length                       
    Sepal.Width                        
    Petal.Length
    

    Compare the output of model2 in the fresh R session compared to that of the first session above;

    > all.equal(model2, gamlss$model2)
    [1] "Component “Df”: 'is.NA' value mismatch: 1 in current 4 in target"     
    [2] "Component “AIC”: 'is.NA' value mismatch: 0 in current 3 in target"    
    [3] "Component “LRT”: 'is.NA' value mismatch: 1 in current 4 in target"    
    [4] "Component “Pr(Chi)”: 'is.NA' value mismatch: 2 in current 4 in target"
    

    Something is clearly not correct here.

    I suspect that the model object contains one or more so-called promises that are not preserved correctly when transferred to another R process (as is the case when you use SNOW clusters).

    I would argue this is an issue with the gamlss package itself. The problem appears to be that gamlss object cannot be serialized. I suggest that you report this to the package maintainer. Feel free to use my minimal example in your report.