I have asked this question previously (see here) and received a satisfactory answer using the purr
package. However, this has proved to be a bottle neck in my program so I would like to rewrite the section using the RCPP
package.
Proper subset: A proper subset S' of a set S is a subset that is strictly contained in S and so excludes S itself (note I am also excluding the empty set).
Suppose you have the following vectors in a list:
a = c(1,2)
b = c(1,3)
c = c(2,4)
d = c(1,2,3,4)
e = c(2,4,5)
f = c(1,2,3)
My aim is to keep only vectors which have no proper subset within the list, which in this example would be a, b and c.
Previous Solution
library(purr)
possibilities <- list(a,b,c,d,e,f)
keep(possibilities,
map2_lgl(.x = possibilities,
.y = seq_along(possibilities),
~ !any(map_lgl(possibilities[-.y], function(z) all(z %in% .x)))))
The notion here is to avoid the O(N^3) and use a less order instead. The other answer provided here will be slow still since it is greater than O(N^2). Here is a solution with less than O(N^2), where the worst case scenario is O(N^2) when all the elements are unique.
onlySet <- function(x){
i <- 1
repeat{
y <- sapply(x[-1], function(el)!all(is.element(x[[1]], el)))
if(all(y)){
if(i==length(x)) break
else i <- i+1
}
x <- c(x[-1][y], x[1])
}
x
}
Now to show the time difference, check out the following:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
method1 <- function(a){
mat <- outer(a, a, match_fun)
a[colSums(mat) == 1]
}
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(method1(poss), onlySet(poss))
Unit: milliseconds
expr min lq mean median uq max neval cld
method1(poss) 840.7919 880.12635 932.255030 889.36380 923.32555 1420.1077 100 b
onlySet(poss) 1.9845 2.07005 2.191647 2.15945 2.24245 3.3656 100 a