Search code examples
rparallel.foreachdoparallelsnow

Parallel computation in R for saving data over loops


My efforts in applying parallel on the below simple code to save outputs with Openxlsx over multiple loops is failed.

Anyone can help please to convert this code to the parallel mode. This code on real size data (over 50 million observations, takes 13 hours to run). Reducing it even 2 hours is a big job for me.

library(dplyr)
library(readxl)
library(openxlsx)
library(foreach)
library(doParallel)


rawdata <- readxl::read_xlsx("~/Desktop/Book1.xlsx")


TYPE1 <- rawdata %>% filter(TYPE == "A") 
TYPE2 <- rawdata %>% filter(TYPE == "B") 

Split.TYPE1 <- split(TYPE1, TYPE1$Name) 
Split.TYPE2 <- split(TYPE2, TYPE2$Name) 


#--------------------------------- Save the TYPE A reports------------------------------------------------------------------------------
###################################(the foreach lines are coded)
for (nm in names(Split.TYPE1)){
#foreach(nm=1:names(Split.TYPE1), .combine=cbind) %dopar% {
  file<-paste0(nm,".xlsx")
  d1<-as.data.frame(Split.TYPE1[[nm]])

  wb<-createWorkbook(file)
  addWorksheet(wb, "test", gridLines = T)
  writeData(wb, sheet = "test", x = d1)
  saveWorkbook(wb, file, overwrite = TRUE)
}

# #------------------------------ Save the TYPE B in a folder ----------------------------------
for (dn in names(Split.TYPE2)){
   dnn <- paste0(dn)
  dir.create(dnn)
  sub_Split.TYPE2 <- split(Split.TYPE2[[dn]], Split.TYPE2[[dn]]$Surname)
  for (fn in names(sub_Split.TYPE2)){
    file<-file.path(dnn, paste0(fn,".xlsx"))

    d1<-as.data.frame(sub_Split.TYPE2[[fn]])
    wb<-createWorkbook(file)
    addWorksheet(wb, "test", gridLines = T)
    writeData(wb, sheet = "test", x = d1)
    saveWorkbook(wb, file, overwrite = TRUE)
  }}

Data:

Name    Surname TYPE
John    Greer   A
David   bear    A
Rose    beer    B
Tara    tea     B
Sam     Mac     B
Alan    Glass   B
Brad    Newman  A
Kristen Goodman A
Jessica Goodwin A
Heather Poker   B

Solution

  • Because I don't have your data, I made some small dummy sample.

    The packages I used:

    library(tidyverse)
    library(openxlsx)
    library(foreach)
    library(doParallel)
    

    This part from you and didn't change anything.

    TYPE1 <- rawdata %>% filter(TYPE == "A") 
    TYPE2 <- rawdata %>% filter(TYPE == "B") 
    
    Split.TYPE1 <- split(TYPE1, TYPE1$Name) 
    Split.TYPE2 <- split(TYPE2, TYPE2$Name) 
    

    Define the parallel backend. I'am using 6 cores here.

    cl <- makeCluster(6)
    registerDoParallel(cl)
    

    This is your first loop. Don't forget to add .packages = "openxlsx". This makes sure the package gets also send to the worker. I change a little bit the code, because nm in names(Split.TYPE1) doesn't work for foreach. Maybe there is an easier solution, but I don't know it.

    foreach(nm = 1:length(Split.TYPE1), .combine = cbind, .packages = "openxlsx") %dopar% {
      file <- paste0(names(Split.TYPE1)[nm], ".xlsx")
      d1 <- as.data.frame(Split.TYPE1[[names(Split.TYPE1)[nm]]])
    
      wb <- createWorkbook(file)
      addWorksheet(wb, "test", gridLines = TRUE)
      writeData(wb, sheet = "test", x = d1)
      saveWorkbook(wb, file, overwrite = TRUE)
    }
    

    The second loop. I only used it once in the past and it worked quite well for me. This is how you can make a nested foreach loop. More info here.

    foreach(dn = 1:length(Split.TYPE2)) %:%
      foreach(fn = 1:length(unique(Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)), .packages = "openxlsx") %dopar% {
        dnn <- paste0(names(Split.TYPE2)[dn])
        dir.create(dnn)
        sub_Split.TYPE2 <- split(Split.TYPE2[[names(Split.TYPE2)[dn]]], Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)
    
    
        file <- file.path(dnn, paste0(names(sub_Split.TYPE2)[fn],".xlsx"))
    
        d1 <- as.data.frame(sub_Split.TYPE2[[fn]])
        wb <- createWorkbook(file)
        addWorksheet(wb, "test", gridLines = T)
        writeData(wb, sheet = "test", x = d1)
        saveWorkbook(wb, file, overwrite = TRUE)
      }
    

    And stop the parallel backend.

    stopCluster(cl)
    

    Using your data I get the following folder/file structure for the nested loop:

    - Alan
        - Glass.xlsx
    - Heather
        - Poker.xlsx
    - Rose
        - beer.xlsx
    - Sam
        - Mac.xlsx
    - Tara
        - tea.xlsx