My goal is to sample k integers from 0, ... n-1 without duplication. The order of sampled integers doesn't matter. At every each call (which occurs very often), n and k will slightly vary but not much (n is about 250,000 and k is about 2,000). I've come up with the following amortized O(k) algorithm:
I think uniform sampling of permutation/combination should be an exhaustively studied problem, so either (1) there is a much better solution, or at least (2) my solution is a (minor modification of) a well-known solution. Thus,
Please help me. Thanks.
If k
is much less than n
-- say, less than half of n
-- then the most efficient solution is to keep the numbers generated in a hash table (actually, a hash set, since there is no value associated with a key). If the random number happens to already be in the hash table, reject it and generate another one in its place. With the actual values of k
and n
suggested (k ∼ 2000; n ∼ 250,000
) the expected number of rejections to generate k
unique samples is less than 10, so it will hardly be noticeable. The size of the hash table is O(k), and it can simply be deleted at the end of the sample generation.
It is also possible to simulate the FYK shuffle algorithm using a hash table instead of a vector of n
values, thereby avoiding having to reject generated random numbers. If you were using a vector A
, you would start by initializing A[i]
to i
, for every 0 ≤ i < k
. With the hash table H
, you start with an empty hash table, and use the convention that H[i]
is considered to be i
if the key i
is not in the hash table. Step 3 in your algorithm -- "swap A[r]
with A[i]
" -- becomes "add H[r]
as the next element of the sample and set H[r]
to H[i]
". Note that it is unnecessary to set H[i]
because that element will never be referred to again: all subsequent random numbers r
are generate from a range which does not include i
.
Because the hash table in this case contains both keys and values, it is larger than the hash set used in alternative 1, above, and the increased size (and consequent increase in memory cache misses) is likely to cause more overhead than is saved by eliminating rejections. However, it has the advantage of working even if k
is occasionally close to n
.
Finally, in your proposed algorithm, it is actually quite easy to restore A
in O(k) time. A value A[j]
will have been modified by the algorithm only if:
a. n − k ≤ j < n
, or
b. there is some i
such that n − k ≤ i < n
and A[i] = j
.
Consequently, you can restore the vector A
by looking at each A[i]
for n − k ≤ i < n
: first, if A[i] < n−k
, set A[A[i]]
to A[i]
; then, unconditionally set A[i]
to i
.