Search code examples
compression

Why store hash of decompressed data?


In the LZAV compression library API there is a comment for the decompress function which advises to store a hash of the original (uncompressed) data, so that after decompressing you can then check the validity of it (i.e., that the decompression process was successful):

Note that while the function does perform checks to avoid OOB memory accesses, and checks for decompressed data length equality, this is not a strict guarantee of a valid decompression. In cases when the data is stored in a long-term storage without embedded data integrity mechanisms (e.g., a database without RAID 1 guarantee, a binary container without a digital signature nor CRC), then a checksum (hash) of original uncompressed data should be stored, and then evaluated against that of the decompressed data. Also, a separate checksum (hash) of application-defined header, which contains uncompressed and compressed data lengths, should be checked before decompression. A high-performance "komihash" hash function can be used to obtain a hash value of the data.

Would I be able to forego the hashing of the uncompressed data if I check the integrity/sameness of the bytes where I store the compressed data? For example, I compress some data, and then save to storage along with other data. If I check the validity of that stored data, then I ensure that the input I pass to the decompressor is the same as that which came out of the compressor. If the input given to the decompressor is identical to the output given by the compressor, then it's guaranteed to decompress correctly, right?


Solution

  • Yes, it is guaranteed to decompress the original input, if there are no bugs whatsoever in the compression code and the decompression code.

    I have trust issues, so I would prefer to retain the integrity check.