I have a data file with a large number of values (53,000,000+) and I would like to pull out a random subset of n of these values (say, 2,000,000). I implemented a Perl script that pulls the list into memory, uses the Fisher-Yates method to shuffle the array, and then prints out the first n values in the shuffled list. However, this shuffling process is taking a lot of time, even on much smaller test sets (50,000 values).
I'm looking for a more efficient, scalable way to identify a random subset of a huge set of values and print it out. Any suggestions?
Update: Based on the answers and some more searching, it looks like the correct terminology is "random sampling".
Elaborating on aix's answer above, to choose k
out of a stream of items, read the items one at a time. Keep the first k
items in a set S
.
Now when reading the m
-th item I
(m>k
now), keep it with probability k/m
. If you do keep it, select an item U
uniformly at random from S
, and replace U
with I
.
The proof that this yields all subsets of size k
with equal probability is based on induction on m
. Note that you don't need to know n
(the total number of items) in advance, and that S
at each step is suitable. The algorithm is "streaming" - it doesn't require storing all items, or making a second pass.