Search code examples
mathprobabilitysha1hash-collision

Chance of a duplicate hash when using first 8 characters of SHA1


If I have an index of URLs, and ID them by the first 8 characters of a SHA1 hash, what is the probability of two different URLs having identical IDs?


Solution

  • @Teepeemm has correctly answered the related question ‘given a particular sequence of 8 hex digits, what is the chance of another SHA-1 hash appearing with the same 8 digits?’ It's a very small number.

    What's at stake in this question, though, is a different question: ‘given a large number of 8-hex-digit sequences, what's the chance of any two of them being the same?’ As the first comment to the question points out, this is related to the birthday paradox, which is not ‘what's the chance of someone in the room having the same birthday as me?’, but instead ‘what's the chance of any two people in this room having the same birthday?’ As is reasonably well known, the chance of that is 50% with only 23 people.

    The hash-collision problem is essentially the same problem, but generalised from N=365 days to N=16^8 8-byte sequences, which is about 4.30e9. That's the ‘generalised birthday problem’. Using the expression quoted there (n=sqrt(2*d*ln(1/(1-p))), with d=4.30e9 and p=0.5, we find a 50% chance of a collision with only 77000 trials. If you plot the corresponding function, you see the probability increases quite rapidly as the number of trials increases.

    Even with 16 bytes of hash (so d=16^16) there's a 50% chance of a collision after only 5 billion trials.

    Happy birthday!