Search code examples
c++rrcpp

Rcpp fast statistical mode function with vector input of any type


I'm trying to build a super fast mode function for R to use for aggregating large categorical datasets. The function should take vector input of all supported R types and return the mode. I have read This post, This Help-page and others, but I was not able to make the function take in all R data types. My code now works for numeric vectors, I am relying on Rcpp sugar wrapper functions:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
int Mode(NumericVector x, bool narm = false) 
{
    if (narm) x = x[!is_na(x)];
    NumericVector ux = unique(x);
    int y = ux[which_max(table(match(x, ux)))];
    return y;
}

In addition I was wondering if the 'narm' argument can be renamed 'na.rm' without giving errors, and of course if there is a faster way to code a mode function in C++, I would be grateful to know about it.


Solution

  • In order to make the function work for any vector input, you could implement @JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x)). But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE> argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE>, then we can make use of RCPP_RETURN_VECTOR. Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR. One tricky thing is that you need the storage type for Vector<RTYPE> in order to create a suitable std::unordered_map. Here Rcpp::traits::storage_type<RTYPE>::type comes to the rescue. However, std::unordered_map does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.

    Putting it all together:

    #include <Rcpp.h>
    using namespace Rcpp ;
    
    // [[Rcpp::plugins(cpp11)]]
    #include <unordered_map>
    
    template <int RTYPE>
    Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
      if (narm) x = x[!is_na(x)];
      int myMax = 1;
      Vector<RTYPE> myMode(1);
      // special case for factors == INTSXP with "class" and "levels" attribute
      if (x.hasAttribute("levels")){
        myMode.attr("class") = x.attr("class");
        myMode.attr("levels") = x.attr("levels");
      }
      std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
      modeMap.reserve(x.size());
    
      for (std::size_t i = 0, len = x.size(); i < len; ++i) {
        auto it = modeMap.find(x[i]);
    
        if (it != modeMap.end()) {
          ++(it->second);
          if (it->second > myMax) {
            myMax = it->second;
            myMode[0] = x[i];
          }
        } else {
          modeMap.insert({x[i], 1});
        }
      }
    
      return myMode;
    }
    
    template <>
    Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
      stop("Not supported SEXP type!");
    }
    
    // [[Rcpp::export]]
    SEXP fastMode( SEXP x, bool narm = false ){
      RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
    }
    
    /*** R
    set.seed(1234)
    s <- sample(1e5, replace = TRUE)
    fastMode(s)
    fastMode(s + 0.1)
    l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 
    fastMode(l)
    c <- sample(letters, 1e5, replace = TRUE)
    fastMode(c)
    f <- as.factor(c)
    fastMode(f) 
    */
    

    Output:

    > set.seed(1234)
    
    > s <- sample(1e5, replace = TRUE)
    
    > fastMode(s)
    [1] 85433
    
    > fastMode(s + 0.1)
    [1] 85433.1
    
    > l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 
    
    > fastMode(l)
    [1] TRUE
    
    > c <- sample(letters, 1e5, replace = TRUE)
    
    > fastMode(c)
    [1] "z"
    
    > f <- as.factor(c)
    
    > fastMode(f) 
    [1] z
    Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
    

    As noted above, the used algorithm comes from Joseph Wood's answer, which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.