Search code examples
rxlsx

Fast way to read xlsx files into R


It is a follow-up question to this one. What is the fastest way to read .xlsx files into R ?

I use library(xlsx) to read in data from 36 .xlsx files. It works. However, the problem is that this is very time consuming (well over 30 minutes), especially when considering the data in each file is not that large (matrix of size 3*3652 in each file). To this end, is there a better to deal with such problem, please? Is there another quick way to read .xlsx into R? Or can I put the 36 files into a single csv file quickly and then read into R?

Moreover, I just realised that readxl cannot write xlsx. Is there a counterpart of it to deal with writing instead of reading?

"Response to those voted this question down":

This question is about fact instead of the so-called "opinionated answers and spam" because speed is time and time is fact but NOT opinion.

Further update:

Perhaps one can explain to us in plain language why some method works much faster than others. I am certainly confused about this.


Solution

  • Here is a small benchmark test. Results: readxl::read_xlsx on average about twice as fast as openxlsx::read.xlsx across different number of rows (n) and columns (p) using standard settings.

    enter image description here

    options(scipen=999)  # no scientific number format
    
    nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)
    pp <- c(1, 5, 10, 20, 30, 40, 50)
    
    # create some excel files
    l <- list()  # save results
    tmp_dir <- tempdir()
    
    for (n in nn) {
      for (p in pp) {
        name <-  
        cat("\n\tn:", n, "p:", p)
        flush.console()
        m <- matrix(rnorm(n*p), n, p)
        file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")
    
        # write
        write.xlsx(m, file)
    
        # read
        elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]
        df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p, 
                         elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
        l <- append(l, list(df))
    
        elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]
        df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p, 
                         elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
        l <- append(l, list(df))
    
      }
    }
    
    # results 
    d <- do.call(rbind, l)
    
    library(ggplot2)
    
    ggplot(d, aes(n, elapsed, color= fun)) + 
      geom_line() + geom_point() +  
      facet_wrap( ~ paste("columns:", p)) +
      xlab("Number of rows") +
      ylab("Seconds")