Search code examples
rr-haven

How to read multiple large sas data files into R, filter rows and save subset datasets as .rds


I have 30 sas-files (dataset1.sas7bdat through dataset30.sas7bdat, approx. 10 GB per file) in a folder, and need to analyse a subset of rows in these data files (all rows where the character variable A begins with 10). Thus, I need to read each of the sas-files into R, filter a subset with grep on variable A and then save each of these filtered datasets as a .rds-file. I'm trying to achieve this using a for loop of list.files() and the Haven package to read the sas-file. In order to avoid going out-of-memory, I need to remove the imported dataset on each iteration after the subset has been filtered and saved as .rds.

Though not elegant nor satisfying, I could hard-code it manually 30 times over like this, copy/pasting and incrementing the suffixes by 1 each time:

dt1 <- haven::read_sas("~/folder/dataset1.sas7bdat")
dt1 <- data.table::as.data.table(dt1)
dt1 <- dt1[grep("^10", A)]
saveRDS(dt1, "~/folder/subset1.rds")

dt2 <- haven::read_sas("~/folder/dataset2.sas7bdat")
dt2 <- data.table::as.data.table(dt2)
dt2 <- dt1[grep("^10", A)]
saveRDS(dt2, "~/folder/subset2.rds")
etc.

While the following for loop technically works to read the files into memory, it is never going to finish due to massively going out of memory, so it does not allow me to filter the data:

folder <- "~/folder/"
file_list <- list.files(path = folder, pattern = "^dataset")
for (i in 1:length(file_list)) {
 assign(file_list[i], Haven::read_sas(paste(folder, file_list[i], sep='')))
}

Is there a way to - on each iteration in the loop - filter the dataset, remove the unfiltered dataset and save the subset in a .rds-file? I can't seem to come up with a way to incorporate this into my approach of using the assign() function. Is there a better way to go about this?


Solution

  • I slept on it, and woke up with a working solution using a function to do the work, and a loop to cycle through filenames. This also enables me to save the output in a different folder (my raw data folder is read-only):

    library(haven)
    library(data.table)
    fromFolder <- "~/folder_with_input_data/"
    toFolder <- "~/folder_with_output_data/"
    import_sas <- function(filename) {
    dt <- read_sas(paste(fromFolder, filename, sep=''), NULL)
    dt <- as.data.table(dt)
    dt <- dt[grep("^10", A)]
    saveRDS(dt, paste(toFolder,filename,'.rds.', sep =''), compress = FALSE)
    remove(dt)
    }
    
    file_list <- list.files(path = fromFolder, pattern="^dataset")
    for (filename in file_list) {
    import_sas(filename)
    }
    

    I haven't tested this with the full 30 files yet. I'll do that tonight. If I encounter problems, I will post an update tomorrow. Otherwise, this question can be closed in 48 hours.

    Update: It worked without a hitch and completed the 297GB conversion in around 13 hours. I don't think it can be optimized to accomplish the task much faster; the vast majority of computing time is spent on opening the sas-files, which I don't think can be done faster by other means than Haven. Unless someone has an idea to optimize the process, this question can be closed.