Search code examples
rreadlines

readLines taking up too much storage


I'm using readLines to read in a bunch of very large .csvs with a ; delimiter using the below code:

read_p <- function(flnm) {
  readLines(flnm)  %>%  str_split(pattern = "; ") 
}

for (i in 1:length(files)){
  dat[[i]] <- read_p(files[[i]])
}

Where files is a vector of filenames. The code itself runs fairly quickly, but it takes up around 4GB in R, whereas it only takes up ~500MB in the folder - is there something I missed in reading it in to avoid? I need to use readLines as there are no headers (so its not really a csv) and each line has a different length/number of columns.

Thanks for any help!


Solution

  • Operating off of test.csv from your previous (since deleted) question, there can be a stark difference after conversion to numeric.

    For the record, the file looks like

    996; 1160.32; 1774.51; 4321.05; 2530.97; 2817.63; 1796.18; ...
    1008; 1774.51; 1796.18; 1192.42; 1285.69; 1225.96; 2229.92; ...
    1020; 1796.18; 1285.69; 711.67; 1761.44; 1016.74; 1671.90; ...
    1032; 1285.69; 1761.44; 1016.74; 1671.90; 725.51; 2466.49; ...
    1044; 1761.44; 1016.74; 725.51; 2466.49; 661.82; 1378.85; ...
    1056; 1761.44; 1016.74; 2466.49; 661.82; 1378.85; 972.94; ...
    1068; 2466.49; 661.82; 1378.85; 972.94; 2259.46; 3648.49; ...
    1080; 2466.49; 1378.85; 972.94; 2259.46; 1287.72; 1074.63; ...
    

    though the real test.csv has 751 lines of text, and each line has between 10001-10017 ;-delimited fields. This (unabridged) file is just under 64 MiB.

    Reading it in, parsing it, and then converting to numbers has a dramatic effect on its object sizes:

    object.size(aa1 <- readLines("test.csv"))
    # Warning in readLines("test.csv") :
    #   incomplete final line found on 'test.csv'
    # 67063368 bytes
    
    object.size(aa2 <- strsplit(aa1, "[; ]+"))
    # 476021832 bytes
    
    object.size(aa3 <- lapply(aa2, as.numeric))
    # 60171040 bytes
    

    and we end up with:

    length(aa3)
    # [1] 751
    
    str(aa3[1:4])
    # List of 4
    #  $ : num [1:10006] 996 1160 1775 4321 2531 ...
    #  $ : num [1:10008] 1008 1775 1796 1192 1286 ...
    #  $ : num [1:10009] 1020 1796 1286 712 1761 ...
    #  $ : num [1:10012] 1032 1286 1761 1017 1672 ...
    

    So reading it in into full-length strings isn't what explodes the memory, it's splitting it into 10000+ fields per line that does you in. This is because there's a larger overhead per character object:

    ### vec-1, nchar-0
    object.size("")
    # 112 bytes
    
    ### vec-5, nchar-0
    object.size(c("","","","",""))
    # 152 bytes
    
    ### vec-5, nchar-10
    object.size(c("aaaaaaaaaa","aaaaaaaaaa","aaaaaaaaaa","aaaaaaaaaa","aaaaaaaaaa"))
    # 160 bytes
    

    If we look at the original data, we'll see this exploding:

    object.size(aa1[1])   # whole lines at a time, so one line is 10000+ characters
    # 89312 bytes
    object.size(aa2[[1]]) # vector of 10000+ strings, each between 3-8 characters
    # 633160 bytes
    

    But fortunately, numbers are much smaller in memory:

    object.size(1)
    # 56 bytes
    object.size(c(1,2,3,4,5))
    # 96 bytes
    

    and it scales much better. Obviously, enough better to reduce the data from 453MiB (split, strings) to 57MiB (split, numbers) in R's storage.


    You will still see a bloom in R's memory usage when reading in these files. You can try to reduce it by converting to numbers immediately after strsplit; to be honest, I don't know how quickly R's garbage collector (a common thing for high-level programming languages) will return the memory, nor am I certain how this will behave in light of R's "global string pool". But if you are interested, you can try this adaptation of your function.

    func <- function(path) {
      aa1 <- readLines(path)
      aa2 <- lapply(aa1, function(st) as.numeric(strsplit(st, "[; ]+")[[1]]))
      aa2
    }
    

    (I make no promises that it will not still "bloom" your memory usage, but perhaps it's a little less-bad.)

    And then the canonical replacement for your for loop (though that loop is fine), is

    dat <- lapply(files, func)