Search code examples
rdata.tablemode

Row Wise Mode in data.table R


I am trying to find an efficient way to get a row wise modes on a subset of columns in data.table

#Sample data    
a <- data.frame( 
        id=letters[], 
        dattyp1 = sample( 1:2, 26, replace=T) , 
        dattyp2 = sample( 1:2, 26, replace=T) , 
        dattyp3 = sample( 1:2, 26, replace=T) ,
        dattyp4 = sample( 1:2, 26, replace=T) , 
        dattyp5 = sample( 1:2, 26, replace=T) , 
        dattyp6 = sample( 1:2, 26, replace=T)
        )

    library(modeest)
    library(data.table)

I know from: To find "row wise" "Mode" of a given data in R that I can do this:

Mode <- function(x) {
     ux <- unique(x)
          ux[which.max(tabulate(match(x, ux)))]
    }   

apply(a[ ,paste0("dattyp",1:6)], 1, Mode)

But this is really slow (over my millions of records). I am thinking there must be a way to do it with .SDcols - but this does column wise modes not row wise.

a<- data.table( a )
    a[ , lapply(.SD , mfv ), .SDcols=c(paste0("dattyp",1:6) ) ]

Solution

  • I think the fastest way via is still to convert into a relational (i.e. long) format and aggregate and then find max in reldtMtd function as follows. I wonder if using Rcpp will be faster.

    data:

    library(data.table)
    M <- 1e6
    popn <- 2
    set.seed(0L)
    a <- data.frame( 
        id=1:M, 
        dattyp1 = sample(popn, M, replace=TRUE), 
        dattyp2 = sample(popn, M, replace=TRUE), 
        dattyp3 = sample(popn, M, replace=TRUE),
        dattyp4 = sample(popn, M, replace=TRUE), 
        dattyp5 = sample(popn, M, replace=TRUE), 
        dattyp6 = sample(popn, M, replace=TRUE)
    )    
    setDT(a)
    

    methods:

    reldtMtd <- function() {
        melt(a, id.vars="id")[, 
            .N, by=.(id, value)][,
                value[which.max(N)], by=.(id)] 
    }
    
    #from https://stackoverflow.com/a/8189441/1989480
    Mode <- compiler::cmpfun(function(x) {   
        ux <- unique(x)
        ux[which.max(tabulate(match(x, ux)))]
    })
    Mode2 <- compiler::cmpfun(function(x) names(which.max(table(x))))
    matA <- as.matrix(a[, -1L])
    
    baseMtd1 <- function() apply(matA, 1, Mode)
    baseMtd2 <- function() apply(matA, 1, Mode2)
    
    library(microbenchmark)
    microbenchmark(reldtMtd(), baseMtd1(), baseMtd2(), times=3L)
    

    timings:

    Unit: seconds
           expr        min         lq       mean     median         uq       max neval
     reldtMtd()   1.882783   1.947515   2.031767   2.012248   2.106259   2.20027     3
     baseMtd1()  15.618716  15.675314  15.809277  15.731913  15.904557  16.07720     3
     baseMtd2() 160.837513 161.692634 162.455048 162.547755 163.263816 163.97988     3