Search code examples
rdata.tabledcast

Is it possible to operate on the aggregated dcast values?


dcast.data.table has the feature to aggregate on multiple value.var. Is it possible to somehow reference the aggregated values being created and then perform an operation within the fun ?

This creates the 4 variables

DT = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE),
                z=sample(letters[1:2], 20,TRUE), d1 = runif(20), d2=1L)

> head(DT)
   x y z        d1 d2
1: 3 2 a 0.6166590  1
2: 3 1 a 0.1891611  1
3: 5 2 a 0.3061658  1
4: 3 1 a 0.7233832  1
5: 2 1 b 0.6799675  1
6: 2 1 b 0.5144392  1

dcast(DT, x + y ~ z, fun=sum, value.var=c("d1", "d2"))

   x y      d1_a      d1_b d2_a d2_b
1: 1 1 1.0400277 0.3835004    2    1
2: 2 1 0.7032111 1.3713884    1    2
3: 3 1 0.9759893 2.0853103    1    3
4: 3 2 0.5210792 0.0000000    1    0
5: 4 1 1.0971931 0.4417819    2    1
6: 4 2 0.5009533 0.0000000    1    0
7: 5 1 0.9372943 0.0000000    4    0
8: 5 2 0.7671728 0.0000000    1    0

And the operation(s) on the aggregated values can be performed in the second step

dcast(DT, x + y ~ z, fun=sum, value.var=c("d1", "d2"))[,.(div1 = d1_a/d2_a
                                                          ,div2 = d1_b/d2_b)]

        div1      div2
1: 0.5200139 0.3835004
2: 0.7032111 0.6856942
3: 0.9759893 0.6951034
4: 0.5210792       NaN
5: 0.5485965 0.4417819
6: 0.5009533       NaN
7: 0.2343236       NaN
8: 0.7671728       NaN


Solution

  • This is a bit convoluted, but for this problem I think you could do the following:

    zs <- unique(DT$z)
    
    sum_div <- function(dt) {
      ans <- dt[, .(div = sum(d1) / sum(d2)), by = .(z)]
      split(ans$div, factor(ans$z, levels = zs), drop = FALSE)
    }
    
    DT[, sum_div(.SD), by = .(x, y), .SDcols = c("z", "d1", "d2")]
    

    What happens is that .SD ends up having the 3 columns specified in .SDcols, but with different subsets for the possible combinations of x and y values. Then, sum_div performs your desired operation only on that subset, and splits the result to return a list so that each possible value of z gets its own column in the final data.table.

    It's important to do factor(ans$z, levels = zs) in order to get the same number of list elements every time (data.table expects that); by specifying how many levels we expect, split will return an empty vector if there are no values for a level, but it will definitely return something for each.

    Note that you could achieve the same with:

    dcast(DT[, .(div = sum(d1) / sum(d2)), by = .(x, y, z)], x + y ~ z, value.var = "div")
    

    I'm not sure if you gain considerable performance by doing everything in one step.

    EDIT: you probably don't:

    library(data.table)
    library(microbenchmark)
    
    n <- 2e5
    DT = data.table(x = sample(5L, n, TRUE),
                    y = sample(3L, n, TRUE),
                    z = sample(letters[1:2], n, TRUE),
                    d1 = runif(n),
                    d2 = 1L)
    
    zs <- sort(unique(DT$z))
    
    sum_div <- function(dt) {
      ans <- dt[, .(div = sum(d1) / sum(d2)), by = .(z)]
      split(ans$div, factor(ans$z, levels = zs), drop = FALSE)
    }
    
    microbenchmark(
      one = DT[, sum_div(.SD), keyby = .(x, y), .SDcols = c("z", "d1", "d2")],
      two = dcast(DT[, .(div = sum(d1) / sum(d2)), by = .(x, y, z)], x + y ~ z, value.var = "div"),
      times = 10L
    )
    Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval
      one 24.37323 25.74273 26.72413 25.99279 26.62943 34.40309    10
      two 11.31050 11.91650 12.66345 12.51094 13.01364 15.35549    10