Search code examples
rdata.tablegroupingaggregatesummary

data.table: Group by, then aggregate with custom function returning several new columns


In data.table, how can I do the following:

  • Group table by several columns combined
  • Then hand over each group to a custom aggregation function, which:
  • Takes all columns from the group table subset and aggregates them by returning several new columns which will be added to the table

The trick here is to generate several new columns without calling the aggregation function more than once.

Example:

library(data.table)
mtcars_dt <- data.table(mtcars)

returnsOneColumn <- function(dt_group_all_columns){
  "returned_value_1"
}

# works great, returns one new column as summary per group
mtcars_dt[,
          list( new_column_1 = returnsOneColumn(dt_group_all_columns= .SD) ),
          by = c("mpg", "cyl"),
          .SDcols = colnames(mtcars_dt)
          ]

returnsMultipleColumns <- function (dt_group_all_columns){
  list( "new_column_1" = "returned_value_1", 
        "new_column_2" = "returned_value_2"  )
}

# does not work: Ideally, I would like to have mpg, cyl, and several columns 
# generated from once calling returnsMultipleColumns
mtcars_dt[,
          list( returnsMultipleColumns(dt_group_all_columns = .SD) ),
          by = c("mpg", "cyl"),
          .SDcols = colnames(mtcars_dt)
          ]

# desired output should look like this
#
#     mpg cyl     new_column_1     new_column_2
# 1: 21.0   6 returned_value_1 returned_value_2
# 2: 22.8   4 returned_value_1 returned_value_2
# 3: 21.4   6 returned_value_1 returned_value_2
# 4: 18.7   8 returned_value_1 returned_value_2

Related:

Assign multiple columns using := in data.table, by group


Solution

  • You are already returning a list from the function. You do not need to list them again. So remove the list and have the code like below

    mtcars_dt[,
               returnsMultipleColumns(dt_group_all_columns = .SD),
               by = c("mpg", "cyl"),
               .SDcols = colnames(mtcars_dt)
               ]
         mpg cyl     new_column_1     new_column_2
     1: 21.0   6 returned_value_1 returned_value_2
     2: 22.8   4 returned_value_1 returned_value_2
     3: 21.4   6 returned_value_1 returned_value_2
     4: 18.7   8 returned_value_1 returned_value_2