Search code examples
phphashmemcachedmd5

Where can I find xxhash64 and md5 collision probability statistics?


I dont find any info about % of collisions for xxhash64.

I'm going to use it for cache system (to generate hash keys which need to be unique, about a hundreds millions). Now i use md5, but i don't need cryptographic property.

So i need some info, to decide does is it a good decision for my task. In best case - comparison of the number of collisions between md5 and xxHash64.


Solution

  • You can calculate yourself by using the birthday problem.

    In general the mathematical expression that gives you the probability of hash function is :

    p(k) = 1 - exp(-k(k-1)/2N, k (number of hashes) randomly generated values, where each value is a non-negative integer less than N (number of possible hashes):

    N = 2^(number of bit), example for md5 it is 2^128, or 2^32 for 32 bit-hash

    If you use md5

    will produce a 128-bit hash value, by applying this formula you get this 'S' graph. This graph explains, for example, in order to get a collison probability of 50% (0.5), you need at least 21 000 000 trillion of hashes or 21 quintillion of hashes!!!! If you we use less than, for instance 1 billion of hashes, the probability of collision is negligible.

    enter image description here

    If you are using hundred millions of hashed keys, the probability of collision is 0% using md5.

    If you use xxhash64,

    Assuming that xxhash64 produce a 64-bit hash. You will get this graph. enter image description here

    According to this picture, you can see that if the collision percentage is 50%, you need at least 5 billion of hashes. Two of the 5 billion of hashes can have an odd of 1/2 to have the same hashes!!! If you have around 12 billion of hashes there is 100% of chance that the hashes collide.

    If you are using hundred millions of hashed keys, the probability of collision is 0.033% using xxhash64.

    This link explains why md5 or fast hash method are not secure.