I have a 4 Gb csv file to load in my 16Gb machine, fread
and read.csv
can't load it at once, they return memory errors.
So I decided to read the file by chunks, and it worked (after one hour or so), and I get a list of data.frames
that takes 2.5 Gb if I trust the Environment tab in RStudio
, and 1.2 Gb when saved as an RDS.
The issue I have now is concatenating everything back into a big data.frame
. from what I understand rbindlist
is the most efficient solution (or is it bind_rows
?), but in my case it still uses too much memory.
I think I can solve this by using rbindlist
on list items n
by n
, then recursively up to when I get my final list. This n
number would have to be calibrated manually though and this process is really ugly (on top of this annoying csv importation).
Another idea that crossed my mind is to find a way to feed an SQLite database from my loaded data, and then query it from R (I'll only do subset
, min
and max
operations on the data).
Can I do better than this ?
My data is only made of integer
and double
, if it makes a difference.
Sounds like bigmemory
might have just enough functionality for your problem
require(bigmemory)
You can read files in as a big.matrix
with
read.big.matrix(filename, sep = ",", header = FALSE, col.names = NULL,
row.names = NULL, has.row.names = FALSE, ignore.row.names = FALSE,
type = NA, skip = 0, separated = FALSE, backingfile = NULL,
backingpath = NULL, descriptorfile = NULL, binarydescriptor = FALSE,
extraCols = NULL, shared = TRUE)
Even with a simple example like iris
, you can see memory savings
x <- as.big.matrix(iris)
options(bigmemory.allow.dimnames=TRUE)
colnames(x) <- c("A", "B", "C", "D", "E")
object.size(x)
# 664 bytes
object.size(iris)
# 7088 bytes
Subsetting big.matrices
is limited but some functionality is provided with mwhich
Subset if column 1 is <= 5
, AND column 2 <= 4
x[mwhich(x, 1:2, list(c(5), c(4)), list(c('le'), c('le')), 'AND'),]
# A B C D E
# 2 4.9 3.0 1.4 0.2 1
# 3 4.7 3.2 1.3 0.2 1
# 4 4.6 3.1 1.5 0.2 1
# 5 5.0 3.6 1.4 0.2 1
# etc
NOTE the result of a subset operation is a regular matrix. You can convert a regular matrix to big.matrix with as.big.matrix()
biganalytics
provides more functionality with big.matrices
require(biganalytics)
colmin(x, cols = 1:2, na.rm = FALSE)
# A B
# 4.3 2.0
colmax(x, cols = 1:2, na.rm = FALSE)
# A B
# 7.9 4.4
Finally you can output as big.matrix
with
write.big.matrix(...)