Search code examples
cparallel-processingchecksumdata-integrity

parallelizable hash function for data integrity check


I need to check the data integrity in the following situation: the data is writen to storage in chunks of different size (for each chunk we know its offset in the final file). However, the chunks come in arbitrary order and in multiple threads. They are read back from the storage in a completely different order (and the chunks have different size).

What I currently have in mind is the following:

    #define MODEST_PRIME 1021
    unsigned char checkbuf[MODEST_PRIME];
    void check_function(unsigned char *chunk, size_t offset, size_t length, unsigned char *result)
    {
       size_t i;
       for(i=0; i<length; i++)
           result[(i+offset)%MODEST_PRIME]^=chunk[i];
    }

This seems to offer protection from changing of any single byte and (to some extent) against swapping of the chunks (it is unlikely that the distance between the swapping blocks will be divisible by the large prime). The results of this function for different chunks can be just xored together, so it is completely parallelizable.

However, this function looks highly unsofisticated in comparison to md5 sum, or any other modern hash function. But as far as I understand the computation of md5 sum or sha-1 sum can not be done in arbitrary order.

Well, the question is, do we have any better solution that

  1. Can be computed in an arbitrary order if we know the size and offset of the chunks (like the simple algorithm I've outlined above).
  2. Can offer a data integrity check at least comparable to that of md5 sum.

Solution

  • One option is a tree-like checksuming hierarchy.

    With two levels you would put the chunks at the 1st (bottom) level of the tree. The 2nd level of the tree is a byte-array created by concatenating the checksums from the lower level.

    This works with any hash function.