Search code examples
rexceltidyverseapache-arrow

Is it possible to delete the first few row of xlsx files (over 100 files) with multiple sheets in r?


I have a series of xlsx files (> 200mb each) with multiple sheets. Only the first sheet of the files contain an introduction, something likes:

This table is designed for balabala etc... balabala
Reference Key date
1 01/01/1999

The number of lines of introductions from each files are not the same, but all the datasets start with Reference Key variable.

Is it possible to avoid reading the whole datasets and deleting the introductions, then merging the sheets from same file into one xlsx file?


Solution

  • Here's a quick process for converting a set of xlsx files with one or more sheets each into a directory tree of CSV files. The loop finds a line that starts with "Reference Key" and, if found, skips to that row; if not found, it skips nothing, assuming that readxl::read_excel will guess appropriately.

    files <- list.files(pattern = "xlsx$", full.names = TRUE)
    for (fn in files) {
      dirnm <- tools::file_path_sans_ext(fn)
      dir.create(dirnm, showWarnings = FALSE)
      for (sht in readxl::excel_sheets(fn)) {
        dat <- readxl::read_excel(fn, sht, n_max = 4, col_types = "text")
        skip <- grep("Reference Key", dat[[1]])[1]
        if (is.na(skip)) skip <- 0L
        newname <- file.path(dirnm, paste0(sht, ".csv"))
        readxl::read_excel(fn, sht, skip = skip) |>
          write.csv(newname, row.names = FALSE)
      }
    }
    

    This works for me given two files:

    • Book1.xlsx with sheets Sheet1 and Sheet2;
    • Book2.xlsx with sheets Sheet1 and Sheet2.

    After this, we now have subdirs with CSV files:

    files
    # [1] "./Book1.xlsx" "./Book2.xlsx"
    list.files(pattern = "csv$", recursive = TRUE, full.names = TRUE)
    # [1] "./Book1/Sheet1.csv" "./Book1/Sheet2.csv" "./Book2/Sheet1.csv" "./Book2/Sheet2.csv"
    

    This should work well for whatever your purpose. If you're dealing with large amounts of data, there are many reasons why you many prefer to write directly to parquet format instead of CSV:

    • lazy reading: in a dplyr pipe, you can use a somewhat-reduced set of mutate, filter, select, and such, and none of the data is read into memory until you finally %>% collect() the data, similar to dbplyr and dtplyr;
    • combine files: if the schema (columns/types) are all the same, then you can use arrow::open_dataset once with all (or a subset of) files, and it will virtually combine them, optionally using hive-partitioning (not used here necessarily, but can be added if applicable);
    • native types: the classes (and attributes) of the R data.frame produced by read_excel are preserved in the parquet file, so if your excel data includes dates, timestamps, etc, you can set this before saving to parquets, and when you read the parquets they will be the correct classes again.

    For that, I think I would modify the inner-most portion of the loop to be something like:

        newname <- file.path(dirnm, paste0(sht, ".pq"))
        readxl::read_excel(fn, sht, skip = skip) %>%
          mutate(
            thedata = as.Date(somedate, format = "....."),
            thetime = as.POSIXct(somestamp, format = ".....")
          ) %>%
          arrow::write_parquet(newname)
    

    and then use arrow::open_dataset on each file (if desired) or something like this if the schema are all the same:

    ds <- list.files(basename(files), pattern = "pq$", recursive = TRUE) |>
      arrow::open_dataset()
    

    and have lazy access to all of the data in one object.