Search code examples
mysqlhashhashcode

Which Hash algorithm should I use to check for file duplicity


I have a WCF service which receive XML files (in a string parameter) for processing. Now I want to implement an error log procedure. I'd like to log an exception when occurred, along with XML file that generated the error.

I've created a MySQL database to do that, and the files will be stored in a long blob field.

My doubt is in how can I avoid duplicity in the table that will store the files, since the user can submit the same file repeated times. To save storage space, I'd like to identify that the exactly same file has already been saved, and in this case, just reuse the reference.

Which method is best for that? My first thought was generating a Hashcode and saving it in another field in the table , so I could use it to search later. When searching for that I discovered that there are various algorithms available to calculate the hash:

System.Security.Cryptography.KeyedHashAlgorithm
System.Security.Cryptography.MD5
System.Security.Cryptography.RIPEMD160
System.Security.Cryptography.SHA1
System.Security.Cryptography.SHA256
System.Security.Cryptography.SHA384
System.Security.Cryptography.SHA512

Which one is better? Is it safe to use one of them to determine if the file is duplicated? What is the difference between using this methods or the .GetHashCode() function?


Solution

  • All hashes intrinsically have collisions, so you cannot use them to reliably identify a file. (If you attempt to, your system will appear to work fine for a while, the length of that while depending on random chance and the size of the hash, before failing catastrophically when it decides two completely different files are the same.)

    Hashes may still be useful as the first step in a mechanism where the hash locates a "bucket" that can contain 0..n files, and you determine actual uniqueness by comparing the full file contents.

    Since this is an application where speed of the hashing algorithm is a positive, I'd use MD5.