Concatenate long list of big matrices efficiently

I have a 4 Gb csv file to load in my 16Gb machine, fread and read.csv can't load it at once, they return memory errors.

So I decided to read the file by chunks, and it worked (after one hour or so), and I get a list of data.frames that takes 2.5 Gb if I trust the Environment tab in RStudio, and 1.2 Gb when saved as an RDS.

The issue I have now is concatenating everything back into a big data.frame. from what I understand rbindlist is the most efficient solution (or is it bind_rows ?), but in my case it still uses too much memory.

I think I can solve this by using rbindlist on list items n by n, then recursively up to when I get my final list. This n number would have to be calibrated manually though and this process is really ugly (on top of this annoying csv importation).

Another idea that crossed my mind is to find a way to feed an SQLite database from my loaded data, and then query it from R (I'll only do subset, min and max operations on the data).

Can I do better than this ?

My data is only made of integer and double, if it makes a difference.

Solution

Sounds like bigmemory might have just enough functionality for your problem

require(bigmemory)

Read files

You can read files in as a big.matrix with

read.big.matrix(filename, sep = ",", header = FALSE, col.names = NULL,
    row.names = NULL, has.row.names = FALSE, ignore.row.names = FALSE,
    type = NA, skip = 0, separated = FALSE, backingfile = NULL,
    backingpath = NULL, descriptorfile = NULL, binarydescriptor = FALSE,
    extraCols = NULL, shared = TRUE)

Save memory

Even with a simple example like iris, you can see memory savings

x <- as.big.matrix(iris)
options(bigmemory.allow.dimnames=TRUE)
colnames(x) <- c("A", "B", "C", "D", "E")

object.size(x)
# 664 bytes

object.size(iris)
# 7088 bytes

Subsetting

Subsetting big.matrices is limited but some functionality is provided with mwhich

Subset if column 1 is <= 5, AND column 2 <= 4

x[mwhich(x, 1:2, list(c(5), c(4)), list(c('le'), c('le')), 'AND'),]

#       A   B   C   D E
# 2   4.9 3.0 1.4 0.2 1
# 3   4.7 3.2 1.3 0.2 1
# 4   4.6 3.1 1.5 0.2 1
# 5   5.0 3.6 1.4 0.2 1
# etc

NOTE the result of a subset operation is a regular matrix. You can convert a regular matrix to big.matrix with as.big.matrix()

Min, max, mean, etc

biganalytics provides more functionality with big.matrices

require(biganalytics)

colmin(x, cols = 1:2, na.rm = FALSE)
#   A   B 
# 4.3 2.0

colmax(x, cols = 1:2, na.rm = FALSE)
#   A   B 
# 7.9 4.4

Output

Finally you can output as big.matrix with

write.big.matrix(...)