Search code examples
rvectoruniquepurrr

return unique combinations of items R?


Given multiple vectors I'd like to return the unique combinations of columns across the vectors. Here is an example that works but isn't as fast as computationally fast as I'd like:

library(dplyr)

c_sort_collapse <- function(...){
  c(...) %>% 
    sort() %>% 
    paste(collapse = ".")
}

unique_set <- function(...){
  list(...) %>% 
    purrr::pmap_chr(c_sort_collapse) %>% 
    unique()
}

unique_set(c("a", "b", "a"), c("a", "a", "b"))
#> [1] "a.a" "a.b"

Is there a way to do this that is faster / better vectorized, i.e. doesn't depend on looping through each item (like happens in the purrr::pmap_chr() step)?


Solution

  • This solution only works when there is two vectors to combine, but is pretty fast. I took the liberty of giving other methods the advantage of stringi::stri_sort() which already is an order of magnitude faster.

    library(dplyr)
    library(stringi)
    
    set.seed(123)
    x <- sample(letters, 1000, replace = TRUE)
    
    set.seed(12)
    y <- sample(letters, 1000, replace = TRUE)
    
    c_sort_collapse <- function(...){
      c(...) |> 
        stri_sort() |> 
        paste(collapse = ".")
    }
    
    unique_set <- function(...){
      list(...) |> 
        purrr::pmap_chr(c_sort_collapse) |> 
        unique()
    }
    
    unique_set_matrix <- function(...){
      matrix(c(...), nrow = length(list(...)), byrow = TRUE) |>
        apply(2, stri_sort) |>
        asplit(2) |>
        unique() |>
        sapply(paste, collapse = ".")
    }
    
    pminmax <- function(x, y) {
      paste(pmin.int(x, y), pmax.int(x, y), sep = ".") |> unique()
    }
    
    all.equal(sort(unique_set(x, y)), sort(pminmax(x, y)))
    #> [1] TRUE
    
    bench::mark(
      tidy = unique_set(x, y),
      matrix = unique_set_matrix(x, y),
      Map = Map(\(x,y) paste0(stri_sort(c(x,y)) , collapse = ".") , x , y) |>
        unique() |> unlist(FALSE, FALSE),
      pminmax = pminmax(x, y),
      iterations = 20, check = FALSE
    )
    #> # A tibble: 4 × 6
    #>   expression      min   median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    #> 1 tidy         6.13ms   6.24ms      160.    45.4KB     17.8
    #> 2 matrix       5.18ms   5.55ms      168.   229.8KB     29.7
    #> 3 Map           5.7ms   6.83ms      151.    33.4KB     16.7
    #> 4 pminmax     484.6µs 487.85µs     2035.    49.6KB      0
    

    Created on 2022-08-05 by the reprex package (v2.0.1)