Let's say we have a list of data.table
's like this:
dt <- data.table(x=rnorm(10^6,100,10), letters=sample(LETTERS,10^6,T))
myList <- list(dt1=dt,dt2=dt,dt3=dt,dt4=dt,dt5=dt)
If I wanted a solution that would calculate the mean per group across all data.tables, I could do the following:
bigDT <- rbindlist(myList)
bigDT[,list('average'=mean(x)),by=letters]
With my data, however, each dt
is quite large (millions of rows) and each list is substantial as well (500-1000 dt
in each list). There are also considerably more than two options for the by
choice.
Part of a function I plan on optimizing using genetic algorithms requires computing the above mean by group. I was wondering if there was a more efficient solution than rbind
-ing the list before using data.table
's ability to calculate per group? Otherwise, the maximization algorithm will makes many function calls to this potentially bottlenecking calculation.
Any help would be appreciated!
microbenchmark(doThis())
Unit: milliseconds
expr min lq mean median uq max neval
doThis() 151.512 154.3395 174.8071 167.7151 170.2952 440.9359 100
One method is to computed the grouped means for each table within the list, then bind, then compute a weighted mean of them. Since you have different counts of each letter, you'll need to preserve the .N
as well.
I'm going to change each element of the list so that we can verify the weighted-mean calculations. For reproducibility:
set.seed(1)
myList <- replicate(5, data.table(x=rnorm(10^6,100,10), letters=sample(LETTERS,10^6,T)),
simplify=FALSE)
myList[1:2]
# [[1]]
# x letters
# 1: 93.73546 P
# 2: 101.83643 I
# 3: 91.64371 F
# 4: 115.95281 V
# 5: 103.29508 D
# ---
# 999996: 109.24487 Q
# 999997: 99.86486 K
# 999998: 93.95941 J
# 999999: 116.28763 O
# 1000000: 106.93750 E
# [[2]]
# x letters
# 1: 97.53576 R
# 2: 105.27503 T
# 3: 107.53592 L
# 4: 102.21228 M
# 5: 98.71087 G
# ---
# 999996: 109.46843 C
# 999997: 99.14458 M
# 999998: 96.76845 Y
# 999999: 94.22413 E
# 1000000: 98.25855 K
To do this for just one table:
head(myList[[1]][,.(mu = mean(x), n = .N), keyby=letters])
# letters mu n
# 1: A 100.04987 39005
# 2: B 100.01288 38576
# 3: C 99.97402 38547
# 4: D 99.99909 38460
# 5: E 100.03689 38030
# 6: F 100.02697 38293
First, compute the averages per-list-element:
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
Now do the weighted-mean either manually or with Hmisc::wtd.mean
:
cbind(
# just to verify the below answer is the same as the brute-force method of rbind-then-average
rbindlist(myList)[,.(mu = mean(x)), keyby=letters],
# either of these is your answer
myAgg[,.(mu = sum(n*mu)/sum(n)),keyby=letters],
myAgg[,.(mu = Hmisc::wtd.mean(mu, weights=n)),keyby=letters]
)
# letters mu letters mu letters mu
# 1: A 100.02325 A 100.02325 A 100.02325
# 2: B 100.03473 B 100.03473 B 100.03473
# 3: C 100.00688 C 100.00688 C 100.00688
# 4: D 100.04041 D 100.04041 D 100.04041
# 5: E 100.00780 E 100.00780 E 100.00780
# 6: F 100.01202 F 100.01202 F 100.01202
# 7: G 100.01200 G 100.01200 G 100.01200
# 8: H 99.97232 H 99.97232 H 99.97232
# 9: I 100.00495 I 100.00495 I 100.00495
# 10: J 100.03019 J 100.03019 J 100.03019
# 11: K 99.96851 K 99.96851 K 99.96851
# 12: L 100.01850 L 100.01850 L 100.01850
# 13: M 100.00976 M 100.00976 M 100.00976
# 14: N 100.01299 N 100.01299 N 100.01299
# 15: O 100.02108 O 100.02108 O 100.02108
# 16: P 100.02052 P 100.02052 P 100.02052
# 17: Q 100.03814 Q 100.03814 Q 100.03814
# 18: R 99.99013 R 99.99013 R 99.99013
# 19: S 99.95219 S 99.95219 S 99.95219
# 20: T 99.97721 T 99.97721 T 99.97721
# 21: U 99.96310 U 99.96310 U 99.96310
# 22: V 99.94430 V 99.94430 V 99.94430
# 23: W 99.98877 W 99.98877 W 99.98877
# 24: X 100.07352 X 100.07352 X 100.07352
# 25: Y 99.96677 Y 99.96677 Y 99.96677
# 26: Z 99.99397 Z 99.99397 Z 99.99397
# letters mu letters mu letters mu
Quick benchmarking, for comparison:
library(microbenchmark)
microbenchmark(
bruteforce = rbindlist(myList)[,.(mu = mean(x)), keyby=letters],
# either of these is your answer
baseR = {
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
myAgg[,.(mu = sum(n*mu)/sum(n)),keyby=letters]
},
Hmisc = {
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
myAgg[,.(mu = Hmisc::wtd.mean(mu, weights=n)),keyby=letters]
},
times=50
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# bruteforce 131.8770 139.4562 153.93202 151.95375 159.6329 315.6117 50
# baseR 89.7047 93.3623 109.20174 98.11670 115.0171 268.2517 50
# Hmisc 89.2784 91.5927 97.87455 93.73475 98.1655 119.2671 50