Search code examples
rr-bigmemory

How to see the actual memory size of a big.matrix object of bigmemory package?


I am using the bigmemory package to load a heavy dataset, but when I check the size of the object (with function object.size), it always returns 664 bytes. As far I as understand, the weight should be almost the same as a classic R matrix, but depending of the class (double or integer). Then, why do I obtain 664 bytes as an answer?. Below, reproducible code. The first chunck is really slow, so feel free to reduce the number of simulated values. With (10^6 * 20) will be enough.

# CREATE BIG DATABASE -----------------------------------------------------  
data <- as.data.frame(matrix(rnorm(6 * 10^6 * 20), ncol = 20))
write.table(data, file = "big-data.csv", sep = ",", row.names = FALSE)
format(object.size(data), units = "auto")
rm(list = ls())

# BIGMEMORY READ ----------------------------------------------------------  
library(bigmemory)
ini <- Sys.time()
data <- read.big.matrix(file = "big-data.csv", header = TRUE, type = "double")
print(Sys.time() - ini)
print(object.size(data), units = "auto")

Solution

  • To determine the size of the bigmemory matrix use:

    > GetMatrixSize(data)
    [1] 9.6e+08
    

    Explanation

    Data stored in big.matrix objects can be of type double (8 bytes, the default), integer (4 bytes), short (2 bytes), or char (1 byte).

    The reason for the size disparity is that data stores a pointer to a memory-mapped file. You should be able to find the new file in the temporary directory of your machine. - [Paragraph quoted from R High Performance Programming]

    Essentially, bigmatrix maintains a binary data file on the disk called a backing file that holds all of the values in a data set. When values from a bigmatrix object are needed by R, a check is performed to see if they are already in RAM (cached). If they are, then the cached values are returned. If they are not cached, then they are retrieved from the backing file. These caching operations reduce the amount of time needed to access and manipulate the data across separate calls, and they are transparent to the statistician.

    See page 8 of the documentation for a description

    https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf

    Ref:

    • R High Performance Programming By: Aloysius Lim; William Tjhi
    • Data Science in R By: Duncan Temple Lang; Deborah Nolan