Search code examples
rxlsxrbind

How to rbind many xlsx files with start and end


I have several files in one place. I merge them like this:

punkty_pgd <- read_xlsx("C:/Users/Desktop/PC- 2021042500.xlsx", sheet = "NEW")
punkty_pgd_2 <- read_xlsx("C:/Users/Desktop/PC- 2021042512.xlsx", sheet = "NEW")
punkty_pgd_3 <- read_xlsx("C:/Users/Desktop/PC- 2021042600.xlsx", sheet = "NEW")
punkty_pgd <- rbind(punkty_pgd, punkty_pgd_2, punkty_pgd_3)

There are dates in their names and I would like to limit this merging somehow, that is, only merge files that have a date range in the name: from date_start to date_end


Solution

  • For the "load many files and process them", a good place to start is https://stackoverflow.com/a/24376207/3358227.

    Lacking that, let's address the filename/date thing.

    First, assuming there are files you don't want to load (this time), then we should first get the names of the files.

    files <- list.files("c:/Users/Desktop", pattern = "\\.xlsx$", full.names = TRUE)
    ### add a fake file without a date, for testing
    files <- c(files, "Something.xlsx")
    
    files
    # [1] "C:/Users/Desktop/PC- 2021042500.xlsx"
    # [2] "C:/Users/Desktop/PC- 2021042512.xlsx"
    # [3] "C:/Users/Desktop/PC- 2021042600.xlsx"
    # [4] "Something.xlsx"                      
    

    Now we can determine what the Date is by parsing the filenames.

    files <- cbind(strcapture("\\b([0-9]{10})\\b", files, list(fn = "")), origfn = files)
    files
    #           fn                               origfn
    # 1 2021042500 C:/Users/Desktop/PC- 2021042500.xlsx
    # 2 2021042512 C:/Users/Desktop/PC- 2021042512.xlsx
    # 3 2021042600 C:/Users/Desktop/PC- 2021042600.xlsx
    # 4       <NA>                       Something.xlsx
    

    (strcapture returns a data.frame, and I'll keep it in there for convenience.)

    files$date <- as.Date(files$fn, format = "%Y%m%d%H")
    files
    #           fn                               origfn       date
    # 1 2021042500 C:/Users/Desktop/PC- 2021042500.xlsx 2021-04-25
    # 2 2021042512 C:/Users/Desktop/PC- 2021042512.xlsx 2021-04-25
    # 3 2021042600 C:/Users/Desktop/PC- 2021042600.xlsx 2021-04-26
    # 4       <NA>                       Something.xlsx       <NA>
    

    Now we can filter just the files we need for a specific date range.

    • base R:

      dates <- as.Date(c("2021-04-20", "2021-04-25"))
      loadthese <- subset(files, dates[1] <= date & date <= dates[2])
      loadthese
      #           fn                               origfn       date
      # 1 2021042500 C:/Users/Desktop/PC- 2021042500.xlsx 2021-04-25
      # 2 2021042512 C:/Users/Desktop/PC- 2021042512.xlsx 2021-04-25
      list_of_frames <- lapply(loadthese$origfn, readxl::read_xlsx, sheet = "NEW")
      punkty_pgd <- do.call(rbind, list_of_frames)
      
    • tidyverse

      dates <- as.Date(c("2021-04-20", "2021-04-25"))
      punkty_pgd <- files %>%
        filter(between(date, dates[1], dates[2])) %>%
        pull(origfn) %>%
        map_dfr(~ read_xlsx(., sheet = "NEW"))