Search code examples
performancesamplingrandom

Efficient random sampling from a huge list


I have a data file with a large number of values (53,000,000+) and I would like to pull out a random subset of n of these values (say, 2,000,000). I implemented a Perl script that pulls the list into memory, uses the Fisher-Yates method to shuffle the array, and then prints out the first n values in the shuffled list. However, this shuffling process is taking a lot of time, even on much smaller test sets (50,000 values).

I'm looking for a more efficient, scalable way to identify a random subset of a huge set of values and print it out. Any suggestions?

Update: Based on the answers and some more searching, it looks like the correct terminology is "random sampling".


Solution

  • Elaborating on aix's answer above, to choose k out of a stream of items, read the items one at a time. Keep the first k items in a set S.

    Now when reading the m-th item I (m>k now), keep it with probability k/m. If you do keep it, select an item U uniformly at random from S, and replace U with I.

    The proof that this yields all subsets of size k with equal probability is based on induction on m. Note that you don't need to know n (the total number of items) in advance, and that S at each step is suitable. The algorithm is "streaming" - it doesn't require storing all items, or making a second pass.