Search code examples
algorithmmathprobabilityinequality

Estimating set cardinality by querying


I have the following problem for homework and I was wondering if you could provide some hints.


Your friend has in mind a set of numbers S ⊆ {1, ... , n}. If you ask him about a set Q ⊆ {1, ..., n} he will answer whether the intersection S ∩ Q is empty or not. Your goal is to estimate (approximately) how many elements are contained in the set S.

Design a strategy to distinguish whether S contains ≤ k or ≥ (1 + ε)k for any k ∈ {1, . . , n}. Show that O (log(1/δ)/ε^2) queries are sufficient to distinguish the 2 cases with probability 1 - δ.


I think that by choosing a random set Q containing every element i ∈ {1, ..., n} with probability 1/k, then if S had at least k elements then the expected number of elements both in S and Q would be |S|/k. Then, if the intersection is empty we could say that (maybe with a good probability, I am not sure) that |S| < k. I think that I should run this experiment t times and have t random variables Xi, where Xi = 1 if the intersection is not empty, 0 otherwise. If the sum of Xi is at least t/2 then we could say that with some probability |S| < k. I am not sure if this is correct and which is the best approach to compute the probability. Moreover, I am not sure how to distinguish whether S contains ≤ k or ≥ (1 + ε)k elements.

How do you think that I should proceed?


Solution

  • I will only get you started.

    Suppose that your guess has a probability p of intersecting with your friend's set. Then "did I intersect" is a random variable whose average value is p, and whose variance is p * (1 - p).

    If your guesses are independent from each other, then after m guesses the number that intersected will be a random variable with mean p*m and variance m*p*(1-p). If m is large, then by the Central Limit Theorem this random variable will follow approximately a normal distribution. But we do not need such a strong result. Chebyshev's Inequality gives us a bound for all m.

    Now make your random sets such that if your friend's set had (1 + ε/2)k elements, then there would be even odds of getting a match. If there are at most k elements then the probability of a match are less than 0.5 by an amount you can calculate. If there are (1 + ε)k or more elements then the probability of a match will be more than 0.5 by an amount you can calculate. So you'll just guess m times, count the matches, and compare to m/2 to make your final conclusion.

    The bounds that you get will not be tight, but they should wind up good enough to be able to answer your homework question.