algorithm stream cardinality-estimation flajolet-martin

What is the intuition behind the Flajolet-Martin algorithm?

I have tried to understand why Flajolet-Martin Algorithm (FM) works for too long. The description of the algorithm here (section 4.4.2) is promising but not perfect.

Why does the the maximum tail length (# of trailing zeros) of any element work as an estimate for number of distinct elements in stream? Imagine having only two distinct elements {1,2} and they hash respectively to {10001, 10000}. This would imply that the number of distinct elements were 2^4, which obviously is not correct.

What's the trick?

Solution

Let's start with a simple question: if you flipped a fair coin three times, what's the probability that you get three consecutive tails? That would be 1/8, since each coin has a 50/50 chance of coming up tails.

Now we can ask - if you were to repeatedly flip a coin three times, roughly how many groups of three flips would you need to have one of them get three tails in a row? Well, since there's a 1/8 chance of getting three tails, you'd expect that you'd need to do this eight times. In fact, that's exactly the expected number of times you'd need to do this.

More generally, how many times would you need to flip a series of k coins before you'd expect to see k consecutive tails? It's probably about 2^k times, since there's a 1/2^k chance of getting k consecutive tails.

Now, suppose someone comes to you and says "hey! I flipped a coin ten times in a row and got ten consecutive tails." You'd be a bit suspicious of this claim if you think the person just tried flipping ten coins once, since with one attempt you have a roughly 1/1000 chance of getting ten consecutive tails. But if you imagine this person tried flipping ten coins in a row over and over and over again, now this is a bit more plausible. You might say back something like "wow! You probably had to flip those coins like, what, 2¹⁰ times?" And while you may be way off - maybe they just got really lucky - you'd still probably have a pretty good estimate of how many coin flip trials they had to do.

Thanks for indulging this little departure. Let's get back to Flajolet-Martin. :-)

The Flajolet-Martin estimator works by hashing elements and keeping track of the number of 0 bits that appear at the end of each of the hashes. Think of the hashes not as numbers, but as sequences of 0s and 1s that encode a series of coin flips. For example, the hash 0110 would be interpreted as "tails, heads, heads, tails."

In this model, the idea of "count how many trailing zeros there are" ends up being essentially equivalent to "count how many consecutive tails were flipped." And using the reasoning from above, it's very unlikely that you're going to see a large run of tails, so if you do see a lot of tails in a row, it probably means you've seen a lot of items.

Of course, as you noted, this isn't perfect, and you could get unlucky by having a hash code with a huge run of consecutive zeros at the back even though you've only seen a small number of items. That's what happened in your above example. To counteract this, you can run several copies of Flajolet-Martin in parallel and aggregating the results together so that a single bad estimate doesn't spoil the overall result. (This, plus a bit more, gives you the famous HyperLogLog estimator!)

Hope this helps!