Search code examples
amazon-s3botoboto3s3cmd

Multipart upload to S3 with hash verification


I am looking for a command line tool or a Python library which allows uploading big files to S3, with hash verification.

There is an AWS article explaining how it can be done automatically by supplying a content-md5 header.

Yet, it is not clear which command line tools do or do not do this:

  • rclone's documentation states that

    files uploaded with multipart upload don’t have an MD5SUM.

  • s3cmd doesn't say anything about this, but it supports md5 for the sync feature

  • s4cmd has a whole paragraph in the manual, but it's still not clear if an upload is actually verified

  • boto3 / s3transfer's upload_file() method doesn't really say anything

Do you have information about any of these tools, or some other tool or Python library or boto3 snippet which handles big file uploads to s3 with the reliability of rsync?


Solution

  • After asking the authors of the official aws cli (boto3) tool, I can conclude that aws cli always verifies every upload, including multi-part ones.

    It does it chunk-by-chunk, using the official MD5 ETag verification for single-part uploads. Additionally, you can also enable SHA256 verification, still chunk-by-chunk.

    aws cli does not however verify the whole, assembled file. For that, you'd need to use some tiny Python function for example:

    def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
        md5s = []
    
        with open(file_path, 'rb') as fp:
            while True:
                data = fp.read(chunk_size)
                if not data:
                    break
                md5s.append(hashlib.md5(data))
    
        if len(md5s) == 1:
            return '"{}"'.format(md5s[0].hexdigest())
    
        digests = b''.join(m.digest() for m in md5s)
        digests_md5 = hashlib.md5(digests)
        return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))