I would like to aggregate a vector/ matrix y by two variables a and b via the fsum function of the collapse package. fsum does not return values for empty intersections. Is there a way to keep empty intersection using the collapse package? I know that I could e.g. work through cross-joins and data.table, but as my function input is a vector and speed really matters, I would like to avoid converting the input matrix to a data.table and then convert the output back to a matrix / vector (for a solution with data.table, see e.g. here: data.table calculate sums by two variables and add observations for "empty" groups).
Here is an example:
library(collapse)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- matrix(rnorm(10), 10, 1)
fsum(x = y, g = data.frame(a = a, b = b))
#> fsum(x = y, g = data.frame(a = a, b = b))
# [,1]
#1.1 -0.40955189
#1.2 -0.05710677
#2.2 0.50360797
#2.3 -1.28459935
#3.1 0.04672617
#3.2 -0.69095384
#3.3 -0.23570656
#4.1 0.80418951
#5.2 1.08576936
What I would like to get: the regular output above, but keeping the empty intersections of (a, b) - e.g (a = 1, b = 3) and assign a missing or zero:
# a b y
#1: 1 1 -0.7702614
#2: 1 2 -0.2992151
#3: 1 3 NA
#4: 2 1 NA
#5: 2 2 -0.4115108
#6: 2 3 0.4356833
#.................
As an addition: base::aggregate()
has a function argument drop = FALSE
that achieves this:
aggregate(y, data.frame(a, b), sum, drop = FALSE)
a b V1
#1 1 1 -0.7702614
#2 2 1 NA
#3 3 1 -1.2375384
#4 4 1 -0.2894616
#5 5 1 NA
#6 1 2 -0.2992151
#7 2 2 -0.4115108
#8 3 2 -0.8919211
#9 4 2 NA
#10 5 2 0.2522234
#11 1 3 NA
#12 2 3 0.4356833
#13 3 3 -0.2242679
#14 4 3 NA
#15 5 3 NA
Nevertheless, in my experience both data.table
and collapse
are significantly faster, butcollapse
has the advantage that it also works with matrix objects (that do not need to be converted to data.table's).
Is there away to achieve this via collapse?
yes you can do that with fsum
, however other functions like fmedian
will warn about that. To do that you need to create factors and interact them using :
like so:
library(collapse)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- matrix(rnorm(10), 10, 1)
fsum(x = y, g = qF(a):qF(b))
# [,1]
# 1:1 -0.7702614
# 1:2 -0.2992151
# 1:3 NA
# 2:1 NA
# 2:2 -0.4115108
# 2:3 0.4356833
# 3:1 -1.2375384
# 3:2 -0.8919211
# 3:3 -0.2242679
# 4:1 -0.2894616
# 4:2 NA
# 4:3 NA
# 5:1 NA
# 5:2 0.2522234
# 5:3 NA
For the earlier example you gave, I'd also like to note that the expensive call to data.frame
is absolutely not necessary, fsum(x = y, g = list(a = a, b = b))
is sufficient.