Search code examples
rperformance

How to best combine unique and match in R?


I found myself often writing code such as

#' @param x input vector
#' @param ... passed to [slow_fun()]
fast_fun <- function(x, ...) {
  u <- unique(x)
  i <- match(x, u)
  v <- slow_fun(u, ...)
  v[i]
}

To accelerate a slow vectorized "pure" function where each input entry could theoretically be computed individually and where input is expected to contain many duplicates.

Now I wonder whether this is the best way to achieve such a speedup or is there some function (preferrably in base R or the tidyverse) which does something like unique and match at the same time?


Benchmarks so far

Thanks for the provided answers. I've written a small benchmark suite to compare the approaches:

method <- list(
  brute = slow_fun,
  unique_match = function(x, ...) {
    u <- unique(x)
    i <- match(x, u)
    v <- slow_fun(u, ...)
    v[i]
  },
  unique_factor = function(x, ...) {
    if (is.character(x)) {
      x <- factor(x)
      i <- as.integer(x)
      u <- levels(x)
    } else {
      u <- unique(x)
      i <- as.integer(factor(x, levels = u))
    }
    v <- slow_fun(u, ...)
    v[i]
  },
  unique_match_df = function(x, ...) {
    u <- unique(x)
    i <- if (is.numeric(x)) {
      match(data.frame(t(round(x, 10))), data.frame(t(round(u, 10))))
    } else {
      match(data.frame(t(x)), data.frame(t(u)))
    }
    v <- slow_fun(u, ...)
    v[i]
  },
  rcpp_uniquify = function(x, ...) {
    iu <- uniquify(x)
    v <- slow_fun(iu[["u"]], ...)
    v[iu[["i"]]]
  }
)

exprs <- lapply(method, function(fun) substitute(fun(x), list(fun = fun)))

settings$bench <- lapply(seq_len(nrow(settings)), function(i) {
  cat("\rBenchmark ", i, " / ", nrow(settings), sep = "")
  x <- switch(
    settings$type[i],
    integer = sample.int(
      n = settings$n_distinct[i],
      size = settings$n_total[i],
      replace = TRUE
    ),
    double = sample(
      x = runif(n = settings$n_distinct[i]),
      size = settings$n_total[i],
      replace = TRUE
    ),
    character = sample(
      x = stringi::stri_rand_strings(
        n = settings$n_distinct[i],
        length = 20L
      ),
      size = settings$n_total[i],
      replace = TRUE
    )
  )
  microbenchmark::microbenchmark(
    list = exprs
  )
})

library(tidyverse)
settings %>%
  mutate(
    bench = map(bench, summary)
  ) %>%
  unnest(bench) %>%
  group_by(n_distinct, n_total, type) %>%
  mutate(score = median / min(median)) %>%
  group_by(expr) %>%
  summarise(mean_score = mean(score)) %>%
  arrange(mean_score)

Currently, the rcpp-based approach is best in all tested settings on my machine but barely manages to exceed the unique-then-match method. I suspect a greater advantage in performance the longer x becomes, because unique-then-match needs two passes over the data while uniquify() only needs one pass.

|expr            | mean_score|
|:---------------|----------:|
|rcpp_uniquify   |   1.018550|
|unique_match    |   1.027154|
|unique_factor   |   5.024102|
|unique_match_df |  36.613970|
|brute           |  45.106015|

Solution

  • I've finally managed to beat unique() and match() using Rcpp to hand-code the algorithm in C++ using a std::unordered_map as core bookkeeping data structure.

    Here is the source code, which can be used in R by writing it into a file and running Rcpp::sourceCpp on it.

    #include <Rcpp.h>
    using namespace Rcpp;
    
    template <int T>
    List uniquify_impl(Vector<T> x) {
      IntegerVector idxes(x.length());
      typedef typename Rcpp::traits::storage_type<T>::type storage_t;
      std::unordered_map<storage_t, int> unique_map;
      int n_unique = 0;
      // 1. Pass through x once
      for (int i = 0; i < x.length(); i++) {
        storage_t curr = x[i];
        int idx = unique_map[curr];
        if (idx == 0) {
          unique_map[curr] = ++n_unique;
          idx = n_unique;
        }
        idxes[i] = idx;
      }
      // 2. Sort unique_map by its key
      Vector<T> uniques(unique_map.size());
      for (auto &pair : unique_map) {
        uniques[pair.second - 1] = pair.first;
      }
      
      return List::create(
        _["u"] = uniques,
        _["i"] = idxes
      );
    }
    
    // [[Rcpp::export]]
    List uniquify(RObject x) {
      switch (TYPEOF(x)) {
      case INTSXP: {
        return uniquify_impl(as<IntegerVector>(x));
      }
      case REALSXP: {
        return uniquify_impl(as<NumericVector>(x));
      }
      case STRSXP: {
        return uniquify_impl(as<CharacterVector>(x));
      }
      default: {
        warning(
          "Invalid SEXPTYPE %d (%s).\n",
          TYPEOF(x), type2name(x)
        );
        return R_NilValue;
      }
      }
    }