Search code examples
rr-collapse

collapse package: sum over two vectors but keep empty intersections


I would like to aggregate a vector/ matrix y by two variables a and b via the fsum function of the collapse package. fsum does not return values for empty intersections. Is there a way to keep empty intersection using the collapse package? I know that I could e.g. work through cross-joins and data.table, but as my function input is a vector and speed really matters, I would like to avoid converting the input matrix to a data.table and then convert the output back to a matrix / vector (for a solution with data.table, see e.g. here: data.table calculate sums by two variables and add observations for "empty" groups).

Here is an example:

library(collapse)

set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- matrix(rnorm(10), 10, 1)

fsum(x = y, g = data.frame(a = a, b = b))
#> fsum(x = y, g = data.frame(a = a, b = b))
#           [,1]
#1.1 -0.40955189
#1.2 -0.05710677
#2.2  0.50360797
#2.3 -1.28459935
#3.1  0.04672617
#3.2 -0.69095384
#3.3 -0.23570656
#4.1  0.80418951
#5.2  1.08576936

What I would like to get: the regular output above, but keeping the empty intersections of (a, b) - e.g (a = 1, b = 3) and assign a missing or zero:

#   a b          y
#1: 1 1 -0.7702614
#2: 1 2 -0.2992151
#3: 1 3         NA
#4: 2 1         NA
#5: 2 2 -0.4115108
#6: 2 3  0.4356833
#.................

As an addition: base::aggregate() has a function argument drop = FALSE that achieves this:

aggregate(y, data.frame(a, b), sum, drop = FALSE)
  a b         V1
#1  1 1 -0.7702614
#2  2 1         NA
#3  3 1 -1.2375384
#4  4 1 -0.2894616
#5  5 1         NA
#6  1 2 -0.2992151
#7  2 2 -0.4115108
#8  3 2 -0.8919211
#9  4 2         NA
#10 5 2  0.2522234
#11 1 3         NA
#12 2 3  0.4356833
#13 3 3 -0.2242679
#14 4 3         NA
#15 5 3         NA

Nevertheless, in my experience both data.table and collapse are significantly faster, butcollapse has the advantage that it also works with matrix objects (that do not need to be converted to data.table's).

Is there away to achieve this via collapse?


Solution

  • yes you can do that with fsum, however other functions like fmedian will warn about that. To do that you need to create factors and interact them using : like so:

    library(collapse)
    
    set.seed(1)
    a <- sample(1:5, 10, replace = TRUE)
    b <- sample(1:3, 10, replace = TRUE)
    y <- matrix(rnorm(10), 10, 1)
    
    fsum(x = y, g = qF(a):qF(b))
    # [,1]
    # 1:1 -0.7702614
    # 1:2 -0.2992151
    # 1:3         NA
    # 2:1         NA
    # 2:2 -0.4115108
    # 2:3  0.4356833
    # 3:1 -1.2375384
    # 3:2 -0.8919211
    # 3:3 -0.2242679
    # 4:1 -0.2894616
    # 4:2         NA
    # 4:3         NA
    # 5:1         NA
    # 5:2  0.2522234
    # 5:3         NA
    

    For the earlier example you gave, I'd also like to note that the expensive call to data.frame is absolutely not necessary, fsum(x = y, g = list(a = a, b = b)) is sufficient.