Given multiple vectors I'd like to return the unique combinations of columns across the vectors. Here is an example that works but isn't as fast as computationally fast as I'd like:
library(dplyr)
c_sort_collapse <- function(...){
c(...) %>%
sort() %>%
paste(collapse = ".")
}
unique_set <- function(...){
list(...) %>%
purrr::pmap_chr(c_sort_collapse) %>%
unique()
}
unique_set(c("a", "b", "a"), c("a", "a", "b"))
#> [1] "a.a" "a.b"
Is there a way to do this that is faster / better vectorized, i.e. doesn't depend on looping through each item (like happens in the purrr::pmap_chr()
step)?
This solution only works when there is two vectors to combine, but is pretty fast. I took the liberty of giving other methods the advantage of stringi::stri_sort()
which already is an order of magnitude faster.
library(dplyr)
library(stringi)
set.seed(123)
x <- sample(letters, 1000, replace = TRUE)
set.seed(12)
y <- sample(letters, 1000, replace = TRUE)
c_sort_collapse <- function(...){
c(...) |>
stri_sort() |>
paste(collapse = ".")
}
unique_set <- function(...){
list(...) |>
purrr::pmap_chr(c_sort_collapse) |>
unique()
}
unique_set_matrix <- function(...){
matrix(c(...), nrow = length(list(...)), byrow = TRUE) |>
apply(2, stri_sort) |>
asplit(2) |>
unique() |>
sapply(paste, collapse = ".")
}
pminmax <- function(x, y) {
paste(pmin.int(x, y), pmax.int(x, y), sep = ".") |> unique()
}
all.equal(sort(unique_set(x, y)), sort(pminmax(x, y)))
#> [1] TRUE
bench::mark(
tidy = unique_set(x, y),
matrix = unique_set_matrix(x, y),
Map = Map(\(x,y) paste0(stri_sort(c(x,y)) , collapse = ".") , x , y) |>
unique() |> unlist(FALSE, FALSE),
pminmax = pminmax(x, y),
iterations = 20, check = FALSE
)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidy 6.13ms 6.24ms 160. 45.4KB 17.8
#> 2 matrix 5.18ms 5.55ms 168. 229.8KB 29.7
#> 3 Map 5.7ms 6.83ms 151. 33.4KB 16.7
#> 4 pminmax 484.6µs 487.85µs 2035. 49.6KB 0
Created on 2022-08-05 by the reprex package (v2.0.1)