Search code examples
algorithmhashchecksum

upgradable digest / checksum algorithm


I would like to create a database containing checksums of a lot of files and I fear for checksum-collissions (two different files with the same checksum).

Question 1: what is the probability two different files will have the same MD5 sum?

As a workaround I thought about using an increasing checksum. Start with a small checksum and, in case of a collision calculate a larger checksum which can be derived to the smaller checksum so I don't have to recalculate the checksums of all my files already in the database... I still want to be able to search for a smaller sized checksums.

Question 2: which checksum / digest algorithm could do this trick? I need a checksum algorithm which can calculate a value of a certain size and "backwards" compatible (of a smaller size). Ie. file1 has a 2 byte checksum of 0x1234 and a 4 byte checksum of 0x12345678, the 2 byte checksum can be derived from the 4 byte checksum.


Solution

  • Question 1: depends how many files you have. For each pair, it is approximately 1 in 2^128. If you have 2^64 files (which I reckon you probably don't), the probability of at least one collision among them is about 0.5.

    This assumes no malice on the part of whoever produces the files. There are known MD5 collisions, and known ways to generate files that collide. If someone can make money at your expense by exposing you to collisions, then the probability of a collision is close to 1 :-)

    Question 2: normally you'd just use a better hash to begin with (perhaps SHA-256) and then your "small" hash is either the first few bytes of the large one, or the first one taken modulo some large number, perhaps a prime. But it depends what you want it for.

    A cheap and cheerful option would be for the "large" hash to be two or more "small" hashes concatenated together - hash the file forwards and backwards, for example. Of course once the small hash is broken, there's no telling whether or not that break will lead to a break of the combination of two+ hashes.