I have a series of xlsx files (> 200mb each) with multiple sheets. Only the first sheet of the files contain an introduction, something likes:
This table is designed for balabala etc... | balabala |
---|---|
Reference Key | date |
1 | 01/01/1999 |
The number of lines of introductions from each files are not the same, but all the datasets start with Reference Key
variable.
Is it possible to avoid reading the whole datasets and deleting the introductions, then merging the sheets from same file into one xlsx file?
Here's a quick process for converting a set of xlsx
files with one or more sheets each into a directory tree of CSV files. The loop finds a line that starts with "Reference Key"
and, if found, skips to that row; if not found, it skips nothing, assuming that readxl::read_excel
will guess appropriately.
files <- list.files(pattern = "xlsx$", full.names = TRUE)
for (fn in files) {
dirnm <- tools::file_path_sans_ext(fn)
dir.create(dirnm, showWarnings = FALSE)
for (sht in readxl::excel_sheets(fn)) {
dat <- readxl::read_excel(fn, sht, n_max = 4, col_types = "text")
skip <- grep("Reference Key", dat[[1]])[1]
if (is.na(skip)) skip <- 0L
newname <- file.path(dirnm, paste0(sht, ".csv"))
readxl::read_excel(fn, sht, skip = skip) |>
write.csv(newname, row.names = FALSE)
}
}
This works for me given two files:
Book1.xlsx
with sheets Sheet1
and Sheet2
;Book2.xlsx
with sheets Sheet1
and Sheet2
.After this, we now have subdirs with CSV files:
files
# [1] "./Book1.xlsx" "./Book2.xlsx"
list.files(pattern = "csv$", recursive = TRUE, full.names = TRUE)
# [1] "./Book1/Sheet1.csv" "./Book1/Sheet2.csv" "./Book2/Sheet1.csv" "./Book2/Sheet2.csv"
This should work well for whatever your purpose. If you're dealing with large amounts of data, there are many reasons why you many prefer to write directly to parquet format instead of CSV:
mutate
, filter
, select
, and such, and none of the data is read into memory until you finally %>% collect()
the data, similar to dbplyr
and dtplyr
;arrow::open_dataset
once with all (or a subset of) files, and it will virtually combine them, optionally using hive-partitioning (not used here necessarily, but can be added if applicable);data.frame
produced by read_excel
are preserved in the parquet file, so if your excel data includes dates, timestamps, etc, you can set this before saving to parquets, and when you read the parquets they will be the correct classes again.For that, I think I would modify the inner-most portion of the loop to be something like:
newname <- file.path(dirnm, paste0(sht, ".pq"))
readxl::read_excel(fn, sht, skip = skip) %>%
mutate(
thedata = as.Date(somedate, format = "....."),
thetime = as.POSIXct(somestamp, format = ".....")
) %>%
arrow::write_parquet(newname)
and then use arrow::open_dataset
on each file (if desired) or something like this if the schema are all the same:
ds <- list.files(basename(files), pattern = "pq$", recursive = TRUE) |>
arrow::open_dataset()
and have lazy access to all of the data in one object.