While checking speed with microbenchmark seems to be straightforward, I struggle with tracking memory usage.
So to have a concrete example, say I have the following:
set.seed(12)
library(data.table)
dt_size <- 1e7
dt <-
data.table(
'a1' = sample(1:1000, size = dt_size, replace = T),
'a2' = sample(1:1000, size = dt_size, replace = T),
'a3' = sample(1:1000, size = dt_size, replace = T),
'b1' = sample(1:1000, size = dt_size, replace = T),
'b2' = sample(1:1000, size = dt_size, replace = T),
'b3' = sample(1:1000, size = dt_size, replace = T)
)
Say I want to split the data: All records with "a1 == 0" should be kept as they are and all others should be aggregated and result in one respective line. So I could do:
dt_aggr <-
rbind(
dt[a1 == '1'],
dt[a1 != '1', lapply(.SD, sum)]
)
Maybe there is a faster way, but that's not the point here.
I am wondering: What is happening here? Is R doing a copy from dt
with the respective expression a1 == '1'
AND a copy with the respective expression a1 != '1'
?
Because if so I would assume that dt
is then at least for a short period twice within the memory and then I would probably think "hm, may not be a good idea I should probably look for a solution completely within the already existing dt". If it does not make a copy than it would be fine for me, at least if there is no faster solution.
Is there a possibility to track this for someone who not deeply understands the memory allocation and wants to just try it out? What is the best way to do so?
You may find bench::mark
to be helpful, e.g., demonstrating the inefficiency of the type coercion in the example:
bench::mark(
dt[a1 == '1'],
dt[a1 == 1],
dt[a1 == 1L]
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 "dt[a1 == \"1\"]" 1.31s 1.31s 0.764 154.5MB 0
#> 2 "dt[a1 == 1]" 1.21ms 1.36ms 660. 77.3MB 6.32
#> 3 "dt[a1 == 1L]" 1.21ms 1.3ms 743. 394.2KB 6.24
Coercion to a string is 3 orders of magnitude slower and uses ~4K times as much memory.