Search code examples
rfileparallel-processingxlsx

How to read excel workbooks with varying sheet names in parallel?


I am trying to use parLapplyLB to read excel files

file.list <- list.files(path="folder path", pattern="*.xlsx", full.names=TRUE)

test <- parLapplyLB(cl, file.list, function(x){
    readxl::read_excel(x, sheet=sheet_in_file)
    }

My problem is that each file in file.list has a different "sheet" name that I need to read. For e.g. in file1, I need to read sheet = "abc", in file2, need to read sheet = "xyz" etc. I tried few things but I have not been able to achieve this, is there a way to do it using parLapplyLB?


Solution

  • parLapplyLB as parallelized variant of lapply essentially loops over one list. What you want is a multivariate variant which allows you to loop over the 1st, 2nd, 3rd, etc. element in both of your two lists, where the single-threaded version would be Map (with a big M). The parallelized version mcMap lives in the parallel package you are using as well.

    We may use parallel::mcMap.

    library(parallel)
    r <- mcMap(openxlsx::read.xlsx, file.list, sheet_in_file)
    r
    # r$`./foo/wb1.xlsx`
    # X1 X2 X3
    # 1  1  1  1
    # 2  1  1  1
    # 
    # $`./foo/wb2.xlsx`
    # X1 X2 X3
    # 1  2  2  2
    # 2  2  2  2
    # 
    # $`./foo/wb3.xlsx`
    # X1 X2 X3
    # 1  3  3  3
    # 2  3  3  3
    # 
    # $`./foo/wb4.xlsx`
    # X1 X2 X3
    # 1  4  4  4
    # 2  4  4  4
    

    (Should work similarly with readxl::read_excel, don't have it installed on this machine.)

    Also see @HenrikB's comment below for a solution using a different package.


    Data:

    dir.create('foo')  ## create dir
    ## create four lists of four dataframes, write to dir as .xlsx
    lapply(1:4, \(j) openxlsx::write.xlsx(lapply(1:4, \(i) data.frame(matrix(i, 2, 3))) |> setNames(LETTERS[1:4]), 
                                          file=sprintf('./foo/wb%s.xlsx', j), overwrite=TRUE)) |>invisible()
    
    file.list <- list.files(path = "./foo", pattern="*.xlsx", full.names=TRUE)
    
    sheet_in_file <- c('A', 'B', 'C', 'D')