Search code examples
r

Read large number of small files efficiently in R


I have about 700K small files (condor logs files and less than 10 KB). There are no rules for filenames. I am using list.files to obtain all filenames, then reading them with readLines, and merge them as a list.

Currently It will take several hours to read all files. These are my codes to read log files.

rm(list = ls())

base <- 'logs-025'
exts <- c('log', 'out', 'err')

for (i in seq(along = exts))
{
    all_files <- list.files(base, paste0('apsim_.*.', exts[i]), full.names = TRUE)
    res <- NULL
    for (j in seq(along = all_files))
    {
        res_j <- readLines(all_files[j])
        res[[j]] <- res_j
    }
    save(res, file = paste0(Sys.info()['nodename'], '-', exts[i], '.RData'))
}

Is there a efficiently way to read large number of small files in R?

Thanks for any advice.

Cheers, Bangyou


Solution

  • Depending on the total size of the data set (i.e. will it fit in memory) you might want to memory map the files (for example with the ff package)

    But in general the performances of R's IO functions are poor and I can recommend writing those loops in C