Search code examples
rcsvmergezipread.csv

Among many subfolders, find CSVs starting with a given string and within ZIPs or not, and merge them while adding their names in a new column with R


In a folder (path = "D:/DataLogs/), I have several subfolders. Inside these subfolders, I would like to retrieve all the csv starting only with "QCLog" and merge them (rbind) into a single data.frame, while creating a first new column including the full path/name of these QCLog csv.

The two difficulties are:

  1. In some subfolders, there may be csv (starting with QCLog or not) directly accessible and others located in several zip files (examples within the green rectangles below).
  2. All CSVs have the same number of columns (n=66) and the same headers. However, some headers contain accents or symbols, almost every second column (one in two) has the exact same header name, and some columns are empty (no NAs).

Is this feasible?
Thanks for help

enter image description here


Solution

  • The following code, although a bit long, works fine and fast:

    library(tidyverse)
    library(data.table)
    
    # paths
    main_path <- "D:/DataLogs/" # target directory
    zip_path <- "D:/DataLogs/Zip_files/" # subfolder where the zips will be unzipped
    
    # specific string of csv file name
    str_csv <- "QCLog"
    
    # list csv
    csv_nozip_list <- list.files(main_path, paste0(str_csv, ".*\\.csv"), recursive = TRUE, full.names = TRUE) # list of QCLog csv that ARE NOT within zip
    zip_list <- list.files(main_path, "\\.zip", recursive = TRUE, full.names = TRUE) # list of all csv -including QCLog- that ARE within zip
    unzip_files <- lapply(zip_list, unzip, exdir = zip_path) # unzip all CSV in the zip_path subfolder
    csv_zip_list <- list.files(zip_path, paste0(str_csv, ".*\\.csv"), recursive = TRUE, full.names = TRUE) # list only QCLog that have been decompressed in zip_path
    
    # read matching files while adding a new filename column
    csv_nozip_read <- lapply(csv_nozip_list, function(x) {data.table(fread(x), "filename" = x)})
    csv_zip_read <- lapply(csv_zip_list, function(x) {data.table(fread(x), "filename" = x)})
    
    # rbind lists
    csv_nozip_rbind <- data.table::rbindlist(csv_nozip_read)
    csv_zip_rbind <- data.table::rbindlist(csv_zip_read)
    
    # rbind selected csv
    dat <- rbind(csv_nozip_rbind, csv_zip_rbind) |>
      distinct() |> # if needed
      relocate(filename)