Search code examples
rdata.tablegrouping

Return multiple rows per group in data.table


Is it possible to return multiple rows per group in a grouped command in data.table? In dplyr, this is done with reframe:

y <- c("a", "b", "d", "f")
df <- tibble(
  g = c(1, 1, 1, 2, 2, 2, 2),
  x = c("e", "a", "b", "e", "f", "c", "a")
)

library(dplyr)
df %>%
  reframe(x = setdiff(x, y), .by = g)
# g x    
# 1 e    
# 2 e    
# 2 c   

In data.table, this returns an error:

library(data.table)
dt <- setDT(df)
dt[, x := setdiff(x, y), g]

Error in [.data.table(df, , :=(x, intersect(x, y)), g) :
Supplied 2 items to be assigned to group 1 of size 3 in column 'x'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

Anyway to get a data.table equivalent of reframe?


Solution

  • Wrap in .(...) and use = in place of := (because it's within .(..)).

    as.data.table(df)[, .(x = setdiff(x, y)), by = g]
    #        g      x
    #    <num> <char>
    # 1:     1      e
    # 2:     2      e
    # 3:     2      c
    

    Note that under the hood, .(.) is really just list(.), so we could also use anything that returns list-like objects, including:

    as.data.table(df)[, list(x = setdiff(x, y)), by = g]
    as.data.table(df)[, data.table(x = setdiff(x, y)), by = g]
    as.data.table(df)[, data.frame(x = setdiff(x, y)), by = g]