Search code examples
rlarge-datalarge-files

Combining 100 RDS/RData files into one large file - large file too big


I have 100 .rds files, each approximately 2510 KB in size, and would like to bind them all together by row into one large data file.

So far I am using this:

memory.limit(size=1500000000)

files = list.files(path = "mypath", pattern = "*.rds")

dat_list = lapply(files, function (x) data.table(readRDS(x)))

all <- do.call("rbind", dat_list)

This seems to work but when running the final line I get a "cannot allocate vector of size..." error which I understand to be due to the final file I am trying to create being too large.

As you can see I have tried increasing the memory limit in R but this does not help. Is there any way I can get around this ? I have read of methods of combining csv files outside of R so the R memory is not affected - is there a similar method that can be used here?

I intend to convert this to a file - mapped big.matrix object later if that helps? I also have the same files in RData format.

Would appreciate any help anyone can offer!


Solution

  • Update: using the newer purrr::map_df() function, which combines map and bind_rows and returns a dataframe

    https://purrr.tidyverse.org/reference/map.html

    library(tidyverse)
    my_files = list.files(pattern = "*.rds")
    my_all <- map_df(my_files, read_rds)
    

    ...

    The dplyr::bind_rows() function is explicitly an efficient implementation of the common pattern of do.call(rbind, dfs) for binding many data frames into one.

    https://dplyr.tidyverse.org/reference/bind.html

    library(tidyverse)
    write_rds(iris, "iris1.rds") #write three sample files
    write_rds(iris, "iris2.rds")
    write_rds(iris, "iris3.rds")
    my_files = list.files(pattern = "*.rds")
    dat_list = lapply(my_files, function (x) read_rds(x)) #switched to only read_rds()
    my_all <- do.call("bind_rows", dat_list) #switched to bind_rows()