Search code examples
algorithmrandomreservoir-sampling

Proof of stream reservoir sampling


I'm quite familiar with Reservoir Sampling algorithm and I'm thinking what if the total size N is given. What benefit can we get under this situation? As a result, here's the algorithm:

Let n be the total size of data
Let k be the total size of sample
for each element from data
    if random(0,1) <= k/n
        put this element into sample
        -- k
    -- n
done

It looks correct at the first glance but I found it hard to prove. Anybody can help me to prove this algorithm in a formal way?


Solution

  • Here's a correctness proof for Dave's fix. Assume without loss of generality that the stream is 1..n. We prove inductively that, after m in {0..n} iterations through the loop, the sample is distributed as the intersection with 1..m of a uniform random k-combination of 1..n.

    The base case, m = 0, is trivial: both the sample and the intersection always are empty. Given the inductive hypothesis for a particular m, we now prove it for m+1. Let S be a random variable representing the set after m iterations and let S' be a random variable representing the set after m+1 iterations. Let & be set intersection. For all k-combinations T, we write

    Pr(S' = T & {1..m+1})
        = Pr(S = T & {1..m}) Pr(S' = T & {1..m+1} | S = T & {1..m}),
    

    since S' = T & {1..m+1} implies that S = T & {1..m}. By the inductive hypothesis and some counting,

    (n choose k) Pr(S = T & {1..m})
        = ((n - m) choose (k - |T & {1..m}|)).
    

    Combining the two equations, we get

    (n choose k) Pr(S' = T & {1..m+1})
        = ((n - m) choose (k - |T & {1..m}|)) Pr(S' = T & {1..m+1} | S = T & {1..m}).
    

    By inspecting Dave's program,

    Pr(m+1 in S' | S) = (k - |S|) / (n - m).
    

    Now there are two cases. The first case is m+1 in T.

    Pr(S' = T & {1..m+1} | S = T & {1..m})
        = Pr(m+1 in S' | S = T & {1..m})
        = (k - |T & {1..m}|) / (n - m)
    (n choose k) Pr(S' = T & {1..m+1})
        = ((n - m) choose (k - |T & {1..m}|)) (k - |T & {1..m}|) / (n - m)
        = (n - m - 1) choose (k - |T & {1..m}| - 1)
        = (n - (m+1)) choose (k - |T & {1..m+1}|).
    

    The second case is m+1 not in T.

    Pr(S' = T & {1..m+1} | S = T & {1..m})
        = Pr(m+1 not in S' | S = T & {1..m})
        = (n - m - (k - |T & {1..m}|)) / (n - m)
    (n choose k) Pr(S' = T & {1..m+1})
        = ((n - m) choose (k - |T & {1..m}|)) (n - m - (k - |T & {1..m}|)) / (n - m)
        = (n - m - 1) choose (k - |T & {1..m}|)
        = (n - (m+1)) choose (k - |T & {1..m+1}|).
    

    In both cases, we prove that Pr(S' = T & {1..m+1}) has the correct value.