My questions are:
What is the fastest way to read large(ish) .xlsx Excel files into R? 10 to 200 MB xlsx files, with multiple sheets.
Can some kind of parallel processing be used, e.g. each core reading a separate sheet of a multi-sheet Excel file?
Is there any other kind of optimisation that can be performed?
What I have understood (and what I haven't) so far:
readxl
tends to be faster than openxlsx
I am interested in tabular data only; I am not interested in the Excel formatting, nor in charts, text labels or any other kind of data.
I am possibly looking to import into tidyverse tibbles, but not necessarily. I will then need to export the tables into a Microsoft SQL Server.
Some background: I mostly use Python and am totally new to R. Reading large Excel files in Python is painfully slow. I have already seen that R's readxl
is much faster than Python's pandas
(on a 15-sheet xlsx, each sheet with 10,000 rows and 32 columns: 5.6 seconds for readxl vs 33 seconds for pandas), so that's great! I would, however, still like to understand if there is any way to make the import even faster. I can read the files with R, export them to SQL, then continue the rest of my workflow with Python reading from SQL.
I don't think converting to CSV is the best option, especially not when readxl is so much faster than Python anyway; basically converting to csv may easily take longer than the time I'd save by reading from csv rather than excel. Plus, at least with Python (I don't really know enough R to have tested this thoroughly with readxl), inferring data types works much better with xlsx than with csv.
My code (any critique or suggestion is more than welcome):
library(readxl)
library(tidyverse)
library(tictoc)
this.dir <- dirname(parent.frame(2)$ofile)
setwd(this.dir)
tic("readxl")
path <- "myfile.xlsx"
sheetnames <- excel_sheets(path)
mylist <- lapply(excel_sheets(path), read_excel, path = path)
names(mylist) <- sheetnames
toc()
You could try to run it in parallel using the parallel
package, but it is a bit hard to estimate how fast it will be without sample data:
library(parallel)
library(readxl)
excel_path <- ""
sheets <- excel_sheets(excel_path)
Make a cluster with a specified number of cores:
cl <- makeCluster(detectCores() - 1)
Use parLapplyLB
to go through the excel sheets and read them in parallel using load balancing:
parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)
You can use the package microbenchmark
to test how fast certain options are:
library(microbenchmark)
microbenchmark(
lapply = {lapply(sheets, function(sheet) {
read_excel(excel_path, sheet = sheet)
})},
parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)},
times = 10
)
In my case, the parallel version is faster:
Unit: milliseconds
expr min lq mean median uq max neval
lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890 10
parralel 58.94018 64.96452 118.5969 71.42688 80.48588 316.9914 10
The test file contains of 6 sheets, each containing this table:
test test1 test3 test4 test5
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
Note:
you can use stopCluster(cl)
to shut down the workers when the process is finished.