Something that has been bugging me about the HyperLogLog algorithm is its reliance on the hash of the keys. The issue I have is that the paper seems to assume that we have a totally random distribution of data on each partition, however in the context it is often used (MapReduce style jobs) things are often distributed by their hash values so all duplicated keys will be on the same partition. To me this means that we should actually be adding the cardinalities generated by HyperLogLog rather then using some sort of averaging technique (in the case where we are partitioned by hashing the same thing that HyperLogLog hashes).
So my question is: is this a real issue with HyperLogLog or have I not read the paper in enough detail
This is a real issue if you use non-independent hash functions for both tasks.
Let's say the partition decides the node by the first b
bits of the hashed values. If you use the same hash function for both partition and HyperLogLog, the algorithm will still work properly, but the precision will be sacrificed. In practice, it'll be equivalent of using m/2^b
buckets (log2m' = log2m-b), because the first b
bits will always be the same, so only log2m-b
bits will be used to choose the HLL bucket.