Search code examples
rnesteddata.tablegrouping

What is the fastest way of creating an identificator for multi-row groups with data.table in R?


I have a dataframe that identifies a set of values with an id:

library(data.table)

dt <- data.table(
  id = rep(c("a", "b", "c"), each = 2),
  value1 = c(1, 1, 1, 2, 1, 1),
  value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#>    id value1 value2
#> 1:  a      1      0
#> 2:  a      1      3
#> 3:  b      1      0
#> 4:  b      2      3
#> 5:  c      1      0
#> 6:  c      1      3

As you can see, the ids a and c identify both the same set of values. So I want to create a "pattern id", that identifies the set of values associated with the ids a and c (obs: an id might identify more than two rows, I just limited them to two rows here for the sake of simplicity).

I did manage to come up with a solution using nested data.tables and match():

dt <- dt[, .(data = list(.SD)), by = id]

unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]

dt
#>    id pattern_id
#> 1:  a          1
#> 2:  b          2
#> 3:  c          1

It does the trick, but it is not as fast as I'd like it to be. match() documentation is pretty clear regarding its efficiency with lists:

Matching for lists is potentially very slow and best avoided except in simple cases.

As you can see, I don't need the actual pattern data in my final result, only a table that associates the ids to the pattern ids. I feel like nesting the data, using it to match and then dropping it afterwards is a bit wasteful, but not sure if there's a better way. I was thinking in something that transform each dataframe into a string, or, even better, something that avoided the nesting altogether, but I couldn't come up with anything better than what I have now.

I have created a bigger dataset to play around with and test different solutions:

set.seed(0)
size <- 1000000
dt <- data.table(
  id = rep(1:(size / 2), each = 2),
  value1 = sample(1:10, size, replace = TRUE),
  value2 = sample(1:10, size, replace = TRUE)
)

Solution

  • Updated (to remove join):

    This one replicates your approach (i.e. it requires that the order is the same as well as the values)

    unique(
      dt[, pattern:=.(paste0(c(value1,value2), collapse=",")), by=id][,.(id,pattern)]
    )[,grp:=.GRP, by=pattern][,pattern:=NULL]
    
           id   grp
       <char> <int>
    1:      a     1
    2:      b     2
    3:      c     1
    

    Prior solution:

    dt[dt[, .(paste0(sort(c(value1,value2)), collapse=",")), by=id] %>% 
         .[,pattern:=.GRP, by=V1] %>% 
         .[,V1:=NULL], on=.(id)]
    

    Output:

           id value1 value2 pattern
       <char>  <num>  <num>   <int>
    1:      a      1      0       1
    2:      a      1      3       1
    3:      b      1      0       2
    4:      b      2      3       2
    5:      c      1      0       1
    6:      c      1      3       1