Search code examples
amazon-web-servicesamazon-s3gzipchecksums3distcp

How do I reproduce checksum of gzip files copied with s3DistCp (from Google Cloud Storage to AWS S3)


I copied a large number of gzip files from Google Cloud Storage to AWS's S3 using s3DistCp (as this AWS article describes). When I try to compare the files' checksums, they differ (md5/sha-1/sha-256 have same issue).

If I compare the sizes (bytes) or the decompressed contents of a few files (diff or another checksum), they match. (In this case, I'm comparing files pulled directly down from Google via gsutil vs pulling down my distcp'd files from S3).

Using file, I do see a difference between the two:

file1-gs-direct.gz: gzip compressed data, original size modulo 2^32 91571
file1-via-s3.gz:    gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 91571

My Goal/Question:

My goal is to verify that my downloaded files match the original files' checksums, but I don't want to have to re-download or analyze the files directly on Google. Is there something I can do on my s3-stored files to reproduce the original checksum?

Things I've tried:

Re-gzipping with different compressions: While I wouldn't expect s3DistCp to change the original file's compression, here's my attempt at recompressing:

target_sha=$(shasum -a 1 file1-gs-direct.gz | awk '{print $1}')
for i in {1..9}; do
  cur_sha=$(cat file1-via-s3.gz | gunzip | gzip -n -$i | shasum -a 1 | awk '{print $1}')
  echo "$i. $target_sha == $cur_sha ? $([[ $target_sha == $cur_sha ]] && echo 'Yes' || echo 'No')"
done

1. abcd...1234 == dcba...4321 ? No
2. ... ? No
...
2. ... ? No

Solution

  • While typing out my question, I figured out the answer:

    S3DistCp is apparently switching the "OS" version in the gzip header, which explains the "FAT filesystem" label I'm seeing with file. (Note: to rule out S3 directly causing the issue, I copied my "file1-gs-direct.gz" up to S3, and after pulling down, the checksum remains the same.)

    Here's the diff between the two files:

    $ diff <(cat file1-gs-direct.gz | hexdump -C) <(cat file1-via-s3.gz | hexdump -C)
    1c1
    < 00000000  1f 8b 08 00 00 00 00 00  00 ff ed 7d 59 73 db 4a  |...........}Ys.J|
    ---
    > 00000000  1f 8b 08 00 00 00 00 00  00 00 ed 7d 59 73 db 4a  |...........}Ys.J|
    

    It turns out the 10th byte in a gzip file "identifies the type of file system on which compression took place" (Gzip RFC):

        +---+---+---+---+---+---+---+---+---+---+
        |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
        +---+---+---+---+---+---+---+---+---+---+
    

    Using hexedit, I'm able to change my "via-s3" file's OS from 00 to FF and then the checksums match.

    Caveat: Editing this on a file that is later decompressed may cause unexpected issues, so use with caution. (In my case, I'm doing a file checksum, so worse case a file shows as mismatching even when the uncompressed contents remained the same).