Search code examples
rrbindlistlarge-data

Concatenate long list of big matrices efficiently


I have a 4 Gb csv file to load in my 16Gb machine, fread and read.csv can't load it at once, they return memory errors.

So I decided to read the file by chunks, and it worked (after one hour or so), and I get a list of data.frames that takes 2.5 Gb if I trust the Environment tab in RStudio, and 1.2 Gb when saved as an RDS.

The issue I have now is concatenating everything back into a big data.frame. from what I understand rbindlist is the most efficient solution (or is it bind_rows ?), but in my case it still uses too much memory.

I think I can solve this by using rbindlist on list items n by n, then recursively up to when I get my final list. This n number would have to be calibrated manually though and this process is really ugly (on top of this annoying csv importation).

Another idea that crossed my mind is to find a way to feed an SQLite database from my loaded data, and then query it from R (I'll only do subset, min and max operations on the data).

Can I do better than this ?

My data is only made of integer and double, if it makes a difference.


Solution

  • Sounds like bigmemory might have just enough functionality for your problem

    require(bigmemory)
    

    Read files

    You can read files in as a big.matrix with

    read.big.matrix(filename, sep = ",", header = FALSE, col.names = NULL,
        row.names = NULL, has.row.names = FALSE, ignore.row.names = FALSE,
        type = NA, skip = 0, separated = FALSE, backingfile = NULL,
        backingpath = NULL, descriptorfile = NULL, binarydescriptor = FALSE,
        extraCols = NULL, shared = TRUE)
    

    Save memory

    Even with a simple example like iris, you can see memory savings

    x <- as.big.matrix(iris)
    options(bigmemory.allow.dimnames=TRUE)
    colnames(x) <- c("A", "B", "C", "D", "E")
    
    object.size(x)
    # 664 bytes
    
    object.size(iris)
    # 7088 bytes
    

    Subsetting

    Subsetting big.matrices is limited but some functionality is provided with mwhich

    Subset if column 1 is <= 5, AND column 2 <= 4

    x[mwhich(x, 1:2, list(c(5), c(4)), list(c('le'), c('le')), 'AND'),]
    
    #       A   B   C   D E
    # 2   4.9 3.0 1.4 0.2 1
    # 3   4.7 3.2 1.3 0.2 1
    # 4   4.6 3.1 1.5 0.2 1
    # 5   5.0 3.6 1.4 0.2 1
    # etc
    

    NOTE the result of a subset operation is a regular matrix. You can convert a regular matrix to big.matrix with as.big.matrix()

    Min, max, mean, etc

    biganalytics provides more functionality with big.matrices

    require(biganalytics)
    
    colmin(x, cols = 1:2, na.rm = FALSE)
    #   A   B 
    # 4.3 2.0
    
    colmax(x, cols = 1:2, na.rm = FALSE)
    #   A   B 
    # 7.9 4.4 
    

    Output

    Finally you can output as big.matrix with

    write.big.matrix(...)