Search code examples
algorithmuniqueprobabilitycardinalityhyperloglog

Simple cardinality estimation algorithm


There's the HyperLogLog algorithm, but it is quite complex.

Is there any simpler space efficient approach that could be expressed in couple of lines of code?


Solution

  • HyperLogLog itself is quite simple, once you understand it (and already have a hash function).

    I have implemented a didactic version in Python with 5 lines for element insertion and another 6 lines for estimation:

    from mmhash import mmhash
    from math import log
    from base64 import b64encode
    
    class HyperLogLog:
        def __init__(self, log2m):
            self.log2m = log2m
            self.m = 1 << log2m
            self.data = [0]*self.m
            self.alphaMM = (0.7213 / (1 + 1.079 / self.m)) * self.m * self.m
    
        def offer(self, o):
            x = mmhash(str(o), 0)
            a, b = 32-self.log2m, self.log2m
            i = x >> a
            v = self._bitscan(x << b, a)
            self.data[i] = max(self.data[i], v)
    
        def count(self):
            estimate = self.alphaMM / sum([2**-v for v in self.data])
            if estimate <= 2.5 * self.m:
                zeros = float(self.data.count(0))
                return round(-self.m * log(zeros / self.m))
            else:
                return round(estimate)
    
        def _bitscan(self, x, m):
            v = 1
            while v<=m and not x&0x80000000:
                v+=1
                x<<=1
            return v
    

    The full implementation can be found here:

    https://github.com/juanplopes/hyperloglog.py/blob/master/hyperloglog.py