I want to filter a large data.table
by group. I can use .SD
or .I
and while I personally think the former is much easier to read, the latter is tremendously faster / uses much less memory (despite using .SDcols
).
To some extent it is clear to me why. For .I
we just need a vector per group, while for .SD
we need a whole data.table
. But I thought that by providing a meaningful .SDcol
argument I could speed up / save some memory.
However, the benchmarks show that the .SD
approach is about 60 times slower and eats up 300 times more memory. Granted, a 4 column .SD
data.table will need more than 4 times the size of a vector. But 60 times slower and 300 times more memory? Could somebody enlighten me, why the .SD
approach eats up so much memory and is thus so much slower? Is there a way how I could speed up the .SD
approach to be faster or is the only option to fall back to the .I
approach?
Data Setup
library(data.table)
## data set up
nr <- 1e6
nc <- 100
grp_perc <- .8
DT <- data.table(ids = sample(paste0("id",
seq(1, round(grp_perc * nr, 0))),
nr, TRUE))
cols <- paste("col", seq(1, nc), sep = "_")
DT[, (cols) := replicate(nc, sample(nr), simplify = FALSE)]
Benchmarks
results <- bench::mark(.I = DT[DT[, .(row_id = .I[which.min(col_1)]),
by = ids]$row_id, c("ids", cols[1:3]), with = FALSE],
.SD = DT[, .SD[which.min(col_1)],
by = ids, .SDcols = cols[1:3]],
iterations = 1, filter_gc = FALSE)
summary(results)
# A tibble: 2 x 13 expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> 1 .I 2.64s 2.64s 0.378 34.4MB 0 1 0 2.64s <df[,4] [571,~ <df[,3] [1,41~ <bch:~ <tibble ~ 2 .SD 2.73m 2.73m 0.00612 9.1GB 0.342 1 56 2.73m <df[,4] [571,~ <df[,3] [2,40~ <bch:~ <tibble ~
Here's an approach that is faster than the .I
for this particular example. Note that this also changes the order which may not be desirable for you.
DT[order(col_1), .SD[1L], by = ids, .SDcols = cols[1:3]]
As @Ian Campbell mentions, this is a Github issue. The good news is that there are some optimizations, one of which being .SD[1L]
. The optimization is that the subsetting is done all in C which makes it very fast.
Here are the benchmarks which includes @sindri_baldur's solution but removes your original .SD
attempt - I didn't want to wait 3 minutes :).
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 .I 4.54s 4.54s 0.220 30MB 0.880 1 4 4.54s
2 self_join 11.32s 11.32s 0.0883 76.3MB 0 1 0 11.32s
3 use_order 3.55s 3.55s 0.282 58.3MB 0 1 0 3.55s
## show that it's equal but re-ordered:
all.equal(DT[DT[, .(row_id = .I[which.min(col_1)]),
by = ids]$row_id, c("ids", cols[1:3]), with = FALSE][order(col_1)],
DT[order(col_1), .SD[1L], by = ids, .SDcols = cols[1:3]])
## [1] TRUE