Search code examples
r

Filter CSV files for specific value before importing


I have a folder with thousands of comma delimited CSV files, totaling dozens of GB. Each file contains many records, which I'd like to separate and process separately based on the value in the first field (for example, aa, bb, cc, etc.).

Currently, I'm importing all the files into a dataframe and then subsetting in R into smaller, individual dataframes. The problem is that this is very memory intensive - I'd like to filter the first column during the import process, not once all the data is in memory.

This is my current code:

setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, fread, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
DF <- rbindlist(temp)
DFaa <- subset(DF, V1 =="aa")

If possible, I'd like to move the "subset" process into lapply.

Thanks


Solution

  • setwd("E:/Data/")
    files <- list.files(path = "E:/Data/",pattern = "*.csv")
    temp <- lapply(files, function(x) subset(fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE), V1=="aa"))
    DF <- rbindlist(temp)
    

    Untested, but this will probably work - replace your function call with an anonymous function.