It is easy to generate a random subset of the powerset if we are able to compute all elements of the powerset first and then randomly draw a sample out of it:
set.seed(12)
x = 1:4
n.samples = 3
library(HapEstXXR)
power.set = HapEstXXR::powerset(x)
sample(power.set, size = n.samples, replace = FALSE)
# [[1]]
# [1] 2
#
# [[2]]
# [1] 3 4
#
# [[3]]
# [1] 1 3 4
However, if the length of x
is large, there will be too many elements for the powerset. I am therefore looking for a way to directly compute a random subset.
One possibility is to first draw a "random length" and then draw random subset of x
using the "random length":
len = sample(1:length(x), size = n.samples, replace = TRUE)
len
# [1] 2 1 1
lapply(len, function(l) sort(sample(x, size = l)))
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 1
#
# [[3]]
# [1] 1
This, however, generates duplicates. Of course, I could now remove the duplicates and repeat the previous sampling using a while
loop until I end up with n.samples
non-duplicate random subsets of the powerset:
drawSubsetOfPowerset = function(x, n) {
ret = list()
while(length(ret) < n) {
# draw a "random length" with some meaningful prob to reduce number of loops
len = sample(0:n, size = n, replace = TRUE, prob = choose(n, 0:n)/2^n)
# draw random subset of x using the "random length" and sort it to better identify duplicates
random.subset = lapply(len, function(l) sort(sample(x, size = l)))
# remove duplicates
ret = unique(c(ret, random.subset))
}
return(ret)
}
drawSubsetOfPowerset(x, n.samples)
Of course, I could now try to optimize several components of my drawSubsetOfPowerset
function, e.g. (1) trying to avoid the copying of the object ret
in each iteration of the loop, (2) using a faster sort, (3) using a faster way to remove duplicates of the list, ...
My question is: Is there maybe a different way (which is more efficient) of doing this?
How about using binary representation? This way we can generate a random subset of integers from the length of the total number of power sets given by 2^length(v)
. From there we can make use of intToBits
along with indexing to guarantee we generate random unique subsets of the power set in an ordered fashion.
randomSubsetOfPowSet <- function(v, n, mySeed) {
set.seed(mySeed)
lapply(sample(2^length(v), n) - 1, function(x) v[intToBits(x) > 0])
}
Taking x = 1:4
, n.samples = 5
, and a random seed of 42, we have:
randomSubsetOfPowSet(1:4, 5, 42)
[[1]]
[1] 2 3 4
[[2]]
[1] 1 2 3 4
[[3]]
[1] 3
[[4]]
[1] 2 4
[[5]]
[1] 1 2 3
It turns out that given a set, we can find all subsets by turning to bits (yes, 0s and 1s). By viewing the elements in a subset as on
elements in the original set and the elements not in that subset as off
, we now have a very tangible way of thinking about how to generate each subset. Observe:
Original set: {a, b, c, d}
| | | |
V V V V b & d
Existence in subset: 1/0 1/0 1/0 1/0 are on
/ \
/ \
| |
V V
Example subset: {b, d} gets mapped to {0, 1, 0, 1}
| \ \ \_______
| | \__ \
| |___ \____ \____
| | | |
V V V V
Thus, {b, d} is mapped to the integer 0*2^0 + 1*2^1 + 0*2^2 + 1*2^3 = 10
This is now a problem of combinations of bits of length n. If you map this out for every subset of A = {a, b, c, d}
, you will obtain 0:15
. Therefore, to obtain a random subset of the power set of A
, we simply generate a random subset of 0:15
and map each integer to a subset of A
. How might we do this?
sample
comes to mind.
Now, it is very easy to go the other way as well (i.e. from an integer to a subset of our original set)
Observe:
Given the integer 10 and set A given above (i.e. {a, b, c, d}) we have:
10 in bits is -->> {0, 1, 0, 1}
Which indices are greater than 0?
Answer: 2 and 4
Taking the 2nd the 4th element of our set gives: {b, d} et Voila!