Search code examples
hashmd5parallel-processingsha1checksum

What hash algorithms are parallelizable? Optimizing the hashing of large files utilizing on multi-core CPUs


I'm interested in optimizing the hashing of some large files (optimizing wall clock time). The I/O has been optimized well enough already and the I/O device (local SSD) is only tapped at about 25% of capacity, while one of the CPU cores is completely maxed-out.

I have more cores available, and in the future will likely have even more cores. So far I've only been able to tap into more cores if I happen to need multiple hashes of the same file, say an MD5 AND a SHA256 at the same time. I can use the same I/O stream to feed two or more hash algorithms, and I get the faster algorithms done for free (as far as wall clock time). As I understand most hash algorithms, each new bit changes the entire result, and it is inherently challenging/impossible to do in parallel.

Are any of the mainstream hash algorithms parallelizable?
Are there any non-mainstream hashes that are parallelizable (and that have at least a sample implementation available)?

As future CPUs will trend toward more cores and a leveling off in clock speed, is there any way to improve the performance of file hashing? (other than liquid nitrogen cooled overclocking?) or is it inherently non-parallelizable?


Solution

  • There is actually a lot of research going on in this area. The US National Institute of Standards and Technology is currently holding a competition to design the next-generation of government-grade hash function. Most of the proposals for that are parallelizable.

    One example: http://www.schneier.com/skein1.2.pdf

    Wikipedia's description of current status of the contest: http://en.wikipedia.org/wiki/SHA-3