Search code examples
rbigdataramdata-analysis

Rule of thumb for memory size of datasets in R


Are there any rule of thumbs to know when R will have problems to deal with a given dataset in RAM (given a PC configuration)?

For example, I have heard that one rule of thumb is that you should consider 8 bytes for each cell. Then, if I have 1.000.000 observations of 1.000 columns that would be close to 8 GB - hence, in most domestic computers, we probably would have to store the data in the HD and access it in chunks.

Is the above correct? Which rule of thumbs for memory size and usage can we apply beforehand? By that I mean enough memory not only to load the object, but to do some basic operations like some data tidying, some data visualisation and some analysis (regression).

PS: it would be nice to explain how the rule of thumb works, so it is not just a blackbox.


Solution

  • The memory footprint of some vectors at different sizes, in bytes.

    n <- c(1, 1e3, 1e6)
    names(n) <- n
    one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")
    
    sapply(
      n,
      function(n)
      {
        strings_of_one_hundred_chars <- replicate(
          n,
          paste(sample(letters, 100, replace = TRUE), collapse = "")
        )
        sapply(
          list(
            Integers                                 = integer(n),
            Floats                                   = numeric(n),
            Logicals                                 = logical(n),
            "Empty strings"                          = character(n),
            "Identical strings, nchar=100"           = rep.int(one_hundred_chars, n),
            "Distinct strings, nchar=100"            = strings_of_one_hundred_chars,
            "Factor of empty strings"                = factor(character(n)),
            "Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
            "Factor of distinct strings, nchar=100"  = factor(strings_of_one_hundred_chars),
            Raw                                      = raw(n),
            "Empty list"                             = vector("list", n)
          ),
          object.size
        )
      }
    )
    

    Some values differ under between 64/32 bit R.

    ## Under 64-bit R
    ##                                          1   1000     1e+06
    ## Integers                                48   4040   4000040
    ## Floats                                  48   8040   8000040
    ## Logicals                                48   4040   4000040
    ## Empty strings                           96   8088   8000088
    ## Identical strings, nchar=100           216   8208   8000208
    ## Distinct strings, nchar=100            216 176040 176000040
    ## Factor of empty strings                464   4456   4000456
    ## Factor of identical strings, nchar=100 584   4576   4000576
    ## Factor of distinct strings, nchar=100  584 180400 180000400
    ## Raw                                     48   1040   1000040
    ## Empty list                              48   8040   8000040
    
    ## Under 32-bit R
    ##                                          1   1000     1e+06
    ## Integers                                32   4024   4000024
    ## Floats                                  32   8024   8000024
    ## Logicals                                32   4024   4000024
    ## Empty strings                           64   4056   4000056
    ## Identical strings, nchar=100           184   4176   4000176
    ## Distinct strings, nchar=100            184 156024 156000024
    ## Factor of empty strings                272   4264   4000264
    ## Factor of identical strings, nchar=100 392   4384   4000384
    ## Factor of distinct strings, nchar=100  392 160224 160000224
    ## Raw                                     32   1024   1000024
    ## Empty list                              32   4024   4000024
    

    Notice that factors have a smaller memory footprint than character vectors when there are lots of repetitions of the same string (but not when they are all unique).