I have about 700K small files (condor logs files and less than 10 KB). There are no rules for filenames. I am using list.files to obtain all filenames, then reading them with readLines, and merge them as a list.
Currently It will take several hours to read all files. These are my codes to read log files.
rm(list = ls())
base <- 'logs-025'
exts <- c('log', 'out', 'err')
for (i in seq(along = exts))
{
all_files <- list.files(base, paste0('apsim_.*.', exts[i]), full.names = TRUE)
res <- NULL
for (j in seq(along = all_files))
{
res_j <- readLines(all_files[j])
res[[j]] <- res_j
}
save(res, file = paste0(Sys.info()['nodename'], '-', exts[i], '.RData'))
}
Is there a efficiently way to read large number of small files in R?
Thanks for any advice.
Cheers, Bangyou
Depending on the total size of the data set (i.e. will it fit in memory) you might want to memory map the files (for example with the ff package)
But in general the performances of R's IO functions are poor and I can recommend writing those loops in C