I came across a Q that was asked in one of the interviews..
Q - Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?
I am looking for -
This may be a vague question.. I tried to google "randomly sample data set" but did not find any relevant results.
Binary sample/don't sample may not be the right answer.. suppose you want to sample 1000 strings and you do it via coin toss.. This would mean that approximately after visiting 2000 strings.. you will be done.. What about the rest of the strings?
I read this post - http://gregable.com/2007/10/reservoir-sampling.html
which answers this Q quite clearly..
Let me put the summary here -
SIMPLE SOLUTION
Assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times.
RESERVOIR SAMPLING
Make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001. If you choose to add it, then replace any element (say element #2) in the reservoir chosen randomly. The element #2 is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.
This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.