Search code examples
rdata.tableaggregatefrequencycategorical-data

Aggregate a categorical data.table column based on frequency of occurrence in one step in R


I got a data.table DT with millions of rows and quite a few columns. I'd like to aggregate the data.table on various columns at the same time. One column 'Var' is a categorical variable and I want to aggregate it in a way that the entry with the most occurrence is chosen.

> require(data.table)
> DT <- data.table(ID = c(1,1,1,1,2,2,2,3,3), Var = c('A', 'B', 'B', 'B', 'C', 'C', 'A', 'A', 'A'))
> DT
   ID Var
1:  1   A
2:  1   B
3:  1   B
4:  1   B
5:  2   C
6:  2   C
7:  2   A
8:  3   A
9:  3   A

My desired output is:

> desired_output
   ID agg_Var
1:  1       B    # B occurred the most for ID = 1
2:  2       C    # C occurred the most for ID = 2
3:  3       A    # A occurred the most for ID = 3

I know i can do this in two steps. First by aggregating the numbers of occurrence for each ID and Var, then choosing the row with maximum frequency:

> ## I know this works but it involves more than one step:
> step1 <- DT[,.( freq = .N), by=.(ID, Var)]
> step1 
   ID Var freq
1:  1   A    1
2:  1   B    3
3:  2   C    2
4:  2   A    1
5:  3   A    2
> step2 <- step1[, .(Var_agg = Var[which.max(freq)]), by = .(ID)]
> step2
   ID Var_agg
1:  1       B
2:  2       C
3:  3       A

I'm looking for a way to do this in one step if possible? The reason is that I have quite a few other aggregations i need to do for this table but the other aggregations all involve one step and it would be great if I didn't have to do a separate aggregation for this column, so that I could just include it with the aggregation of other columns. This problem is a code optimisation issue. I'm only interested in data.table operations, not additional packages.


Solution

  • Create a function for calculation of Mode and do a group by Mode

    Mode <- function(x) {
      ux <- unique(x)
      ux[which.max(tabulate(match(x, ux)))]
    }
    
    DT[, .(agg_Var = Mode(Var)), ID]