Search code examples
rdata.tabler-collapse

How to translate data.table code to collapse


I read about the collapse package recently and tried to translate the following data.table code to collapse to see if it's faster in real world examples.

Here's my data.table code:

library(data.table)
library(nycflights13)

data("flights")
flights_DT <- as.data.table(flights)

val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")

flights_DT[
  j = list(agg_val_var = sum(abs(get(val_var)), na.rm = TRUE)), 
  keyby = c(id_var, by)
][
  i = order(-agg_val_var), 
  j = list(value_share = cumsum(agg_val_var)/sum(agg_val_var)), 
  keyby = by
][
  j = .SD[2L],
  keyby = by
][
  order(-value_share)
]
#>      month day value_share
#>   1:    10   3   0.5263012
#>   2:     1  24   0.5045664
#>   3:     1  20   0.4885145
#>   4:    10  17   0.4870692
#>   5:     3   6   0.4867606
#>  ---                      
#> 361:     5   4   0.3220295
#> 362:     6  15   0.3205974
#> 363:     1  28   0.3197260
#> 364:    11  25   0.3161550
#> 365:     6  14   0.3128286

Created on 2021-03-11 by the reprex package (v1.0.0)

I managed to translate the first data.table call, but struggled later on.

It would be great to see how collapse would be used to handle this use case.


Solution

  • I think it only makes sense to translate data.table code to collapse if

    1. you've come up with an arcane expression in data.table to do something complex statistical it is is not good at (such as weighted aggregation, computing quantiles or the mode by groups, lagging / differencing an irregular panel, grouped centering or linear / polynomial fitting)

    2. you actually don't need the data.table object but would much rather work with vectors / matrices / data.frame's / tibbles

    3. you want to write a statistical program and would much prefer standard evaluation programming over NS eval and data.table syntax or

    4. collapse is indeed substantially faster for your specific application.

    Now to the specific code you have provided. It mixes standard and non-standard evaluation (e.g. through the use of get()), which is something collapse is not very good at. I'll give you 3 solutions ranging from full NS eval to full standard eval base R style programming.

    library(data.table)
    library(nycflights13)
    library(magrittr)
    library(collapse)
    
    data("flights")
    flights_DT <- as.data.table(flights)
    
    # Defining a function for the second aggregation
    myFUN <- function(x) (cumsum(x[1:2])/sum(x))[2L]
    
    # Soluting 1: Non-Standard evaluation
    flights_DT %>%
      fgroup_by(carrier, month, day) %>% 
      fsummarise(agg_val_var = fsum(abs(arr_delay))) %>% 
      roworder(month, day, -agg_val_var, na.last = NA) %>%
      fgroup_by(month, day) %>%
      fsummarise(value_share = myFUN(agg_val_var)) %>% 
      roworder(-value_share)
    #>      month day value_share
    #>   1:    10   3   0.5263012
    #>   2:     1  24   0.5045664
    #>   3:     1  20   0.4885145
    #>   4:    10  17   0.4870692
    #>   5:     3   6   0.4867606
    #>  ---                      
    #> 361:     5   4   0.3220295
    #> 362:     6  15   0.3205974
    #> 363:     1  28   0.3197260
    #> 364:    11  25   0.3161550
    #> 365:     6  14   0.3128286
    

    Created on 2021-03-12 by the reprex package (v0.3.0)

    Note the use of na.last = NA wich actually removes cases where agg_val_var is missing. This is needed here because fsum(NA) is NA and not 0 like sum(NA, na.rm = TRUE). Now the hybrid example which is probably closes to the code you provided:

    val_var <- "arr_delay"
    id_var <- "carrier"
    by <- c("month", "day")
    
    # Solution 2: Hybrid approach with standard eval and magrittr pipes
    flights_DT %>%
      get_vars(c(id_var, val_var, by)) %>%
      ftransformv(val_var, abs) %>% 
      collapv(c(id_var, by), fsum) %>%
      get_vars(c(by, val_var)) %>%
      roworderv(decreasing = c(FALSE, FALSE, TRUE), na.last = NA) %>%
      collapv(by, myFUN) %>%
      roworderv(val_var, decreasing = TRUE) %>%
      frename(replace, names(.) == val_var, "value_share")
    #>      month day value_share
    #>   1:    10   3   0.5263012
    #>   2:     1  24   0.5045664
    #>   3:     1  20   0.4885145
    #>   4:    10  17   0.4870692
    #>   5:     3   6   0.4867606
    #>  ---                      
    #> 361:     5   4   0.3220295
    #> 362:     6  15   0.3205974
    #> 363:     1  28   0.3197260
    #> 364:    11  25   0.3161550
    #> 365:     6  14   0.3128286
    

    Created on 2021-03-12 by the reprex package (v0.3.0)

    Note here that I used frename at the end to give the result column the name you wanted, as you cannot mix standard and non-standard eval in the same function in collapse. Finally, a big advantage of collapse is that you can use it for pretty low-level programming:

    # Solution 3: Programming
    data <- get_vars(flights_DT, c(id_var, val_var, by))
    data[[val_var]] <- abs(.subset2(data, val_var))
    g <- GRP(data, c(id_var, by))
    data <- add_vars(get_vars(g$groups, by), 
                     fsum(get_vars(data, val_var), g, use.g.names = FALSE))
    data <- roworderv(data, decreasing = c(FALSE, FALSE, TRUE), na.last = NA)
    g <- GRP(data, by)
    columns
    data <- add_vars(g$groups, list(value_share = BY(.subset2(data, val_var), g, myFUN, use.g.names = FALSE)))
    data <- roworderv(data, "value_share", decreasing = TRUE)
    data
    #>      month day value_share
    #>   1:    10   3   0.5263012
    #>   2:     1  24   0.5045664
    #>   3:     1  20   0.4885145
    #>   4:    10  17   0.4870692
    #>   5:     3   6   0.4867606
    #>  ---                      
    #> 361:     5   4   0.3220295
    #> 362:     6  15   0.3205974
    #> 363:     1  28   0.3197260
    #> 364:    11  25   0.3161550
    #> 365:     6  14   0.3128286
    

    Created on 2021-03-12 by the reprex package (v0.3.0)

    I refer you to the blog post on programming with collapse for a more interesting example on how this can benefit the development of statistical codes.

    Now for the evaluation, I wrapped these solutions in functions, where DT() is the data.table code you provided, run with 2 threads on a windows machine. This checks equality:

    all_obj_equal(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
    #> TRUE
    
    

    Now the benchmark:

    library(microbenchmark)
    microbenchmark(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
    #> Unit: milliseconds
    #>          expr      min       lq     mean   median       uq       max neval cld
    #>          DT() 85.81079 87.80887 91.82032 89.47025 92.54601 132.26073   100   b
    #>     clp_NSE() 13.47535 14.15744 15.99264 14.80606 16.29140  28.16895   100  a 
    #>  clp_Hybrid() 13.79843 14.23508 16.61606 15.00196 16.83604  32.94648   100  a 
    #>    clp_Prog() 13.71320 14.17283 16.16281 14.94395 16.16935  39.24706   100  a