Search code examples
rmemorydata.tablecopy

Guide on how to trace memory of objects in R


While checking speed with microbenchmark seems to be straightforward, I struggle with tracking memory usage.

So to have a concrete example, say I have the following:

set.seed(12)

library(data.table)

dt_size <- 1e7

dt <- 
  data.table(
    'a1' = sample(1:1000, size = dt_size, replace = T),
    'a2' = sample(1:1000, size = dt_size, replace = T),
    'a3' = sample(1:1000, size = dt_size, replace = T),
    'b1' = sample(1:1000, size = dt_size, replace = T),
    'b2' = sample(1:1000, size = dt_size, replace = T),
    'b3' = sample(1:1000, size = dt_size, replace = T)
  )

Say I want to split the data: All records with "a1 == 0" should be kept as they are and all others should be aggregated and result in one respective line. So I could do:

dt_aggr <- 
  rbind(
    dt[a1 == '1'],
    dt[a1 != '1', lapply(.SD, sum)]
  )

Maybe there is a faster way, but that's not the point here.

I am wondering: What is happening here? Is R doing a copy from dt with the respective expression a1 == '1' AND a copy with the respective expression a1 != '1'?

Because if so I would assume that dt is then at least for a short period twice within the memory and then I would probably think "hm, may not be a good idea I should probably look for a solution completely within the already existing dt". If it does not make a copy than it would be fine for me, at least if there is no faster solution.

Is there a possibility to track this for someone who not deeply understands the memory allocation and wants to just try it out? What is the best way to do so?


Solution

  • You may find bench::mark to be helpful, e.g., demonstrating the inefficiency of the type coercion in the example:

    bench::mark(
      dt[a1 == '1'],
      dt[a1 == 1],
      dt[a1 == 1L]
    )
    #> # A tibble: 3 × 6
    #>   expression             min   median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    #> 1 "dt[a1 == \"1\"]"    1.31s    1.31s     0.764   154.5MB     0   
    #> 2 "dt[a1 == 1]"       1.21ms   1.36ms   660.       77.3MB     6.32
    #> 3 "dt[a1 == 1L]"      1.21ms    1.3ms   743.      394.2KB     6.24
    

    Coercion to a string is 3 orders of magnitude slower and uses ~4K times as much memory.