Search code examples
algorithmmathrandomsampling

How to store a sample of a sequentially presented collection of unknown size?


Suppose I want to store N samples (each sample takes up a significant portion of memory), that should form a representative collection of a total of M>>N samples that are presented to me sequentially. I do not know M in advance, and can only keep N samples in memory simultaneously.

Here, representative, means that each out of the M samples should have equal probability to be stored.


Solution

  • This problem is known as reservoir sampling and there is a very efficient O(M)-time, O(N)-space algorithm for it. The algorithm works as follows: at each point, maintain a "guess" of which N elements you want to choose. Initially, choose the first N elements. Then, after seeing the kth element of the sequence, choose a random number between 1 and k, inclusive. If the number chosen is in the range 1..N, replace the indexed "guessed" item with the current item; otherwise do nothing. You can show using a quick inductive argument that this will sample N elements uniformly at random and with one pass over the data.

    Hope this helps!