Search code examples
rdata.tablelapplymodebigdata

R - Fast Mode Function for use in data.table[,lapply(.SD,Mode),by=.()]


I'm summarizing data in a data.table, group by, where I need to take a single value of a variable in a group. I want this value to be the mode of the group. I think it needs to be mode because usually a group is 8 rows and it will have 2 rows at one value and the other 6 or so rows will be another value.

Here's a simplified example, from this:

key1 2
key1 2
key1 2
key1 8
key1 2
key1 2
key1 2
key1 8

I want this:

key1 2

I was having trouble using the standard mode function provided by base R, so I used this solution here: Most frequent value (mode) by group

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

It worked great on my small test data set, but when I run it on my actual data set (22 million rows) it just runs and runs and runs. All my other data.table operations that are similar work great and really fast, but I'm not using a UDF. This is the structure of my data.table query:

ModeCharacterColumns <- ExposureHistory[,lapply(.SD,Mode), .(Key1=Key1, Key2=Key2, ..., key7=key7, key8=key8), .SDcols=('col1','col2','col3', ..., 'col53')]

So I'm guessing my problem is that my UDF is really slowing things down, does anyone have any suggestions where I can accomplish the same goal but get it done much quicker?

Thank you everyone!

EDIT: Better representation of the data:

DT <- fread("key1A key2A key3A key4A 2 2 4 s
             key1A key2A key3A key4A 2 2 4 s  
             key1A key2A key3A key4A 8 8 8 t
             key1A key2A key3A key4A 2 2 4 s
             key1B key2B key3B key4B 6 6 6 v
             key1B key2B key3B key4B 2 2 5 t
             key1B key2B key3B key4B 2 2 5 v
             key1B key2B key3B key4B 2 2 5 v")

And the desired result:

result <- fread("key1A key2A key3A key4A 2 2 4 s
                 key1B key2B key3B key4B 2 2 5 v")

Solution

  • Try using data.table to tabulate the data:

    DT <- fread("key1 8
                 key1 2
                 key1 2
                 key1 8
                 key1 2
                 key1 2
                 key1 2
                 key1 8")
    
    setkeyv(
      DT[, .N, by = .(V1, V2)], #tabulate
      c("V1", "N") #sort by N
       )[, .(Mode = V2[.N]), by = V1] #most frequent value by V1
    #     V1 Mode
    #1: key1    2
    

    You need to consider tie-breaking carefully. I might actually use a for loop to apply this to more value columns, but you'd need to provide a representative reproducible example if you want me to try that.

    Edit:

    Frank provides one option of doing this for several value columns in a comment:

    DT[, lapply(.SD, function(x) setDT(list(x = x))[, .N, by=x][order(-N)][1L, x]), by=V1]
    

    However, I believe this copies every value column, which might slow it down too much.