Search code examples
amazon-web-servicesboto3sha1

Does boto3 Python SDK "boto3.complete_multipart_upload()" calculate the SHA1 for each part?


To me, the docs are unclear.

If you've enabled additional checksum values for your multipart object, Amazon S3 calculates the checksum for each individual part by using the specified checksum algorithm. The checksum for the completed object is calculated in the same way that Amazon S3 calculates the MD5 digest for the multipart upload. You can use this checksum to verify the integrity of the object.

When I pass a calculated SHA1 as a parameter to the upload_part() method, in the response from this method, I receive an SHA1 back.

Question: Is the SHA1 in the response actually calculated by AWS? If so, what happens when it doesn't match the SHA1 I calculated and sent to AWS?

Function that uploads data using multipart_upload:

def upload_chunk_to_s3(s3_client, chunk, chunk_sha1, bucket, key, part_number, upload_id):
    resp = s3_client.upload_part(
        Bucket=bucket,
        Key=key,
        PartNumber=part_number,
        UploadId=upload_id,
        Body=chunk,
        ChecksumAlgorithm='SHA1',
        ChecksumSHA1=chunk_sha1  #<--this is calculated by me
    )
    return {'PartNumber': part_number, 'ETag': resp['ETag'], 'ChecksumSHA1': resp['ChecksumSHA1']}

Response from AWS:

{
  "ResponseMetadata": {
    "RequestId": "<redact>",
    "HostId": "<redact>",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amz-id-2": "<redact>",
      "x-amz-request-id": "<redact>",
      "date": "Wed, 18 Oct 2023 20:39:50 GMT",
      "etag": ""<redact>"",
      "x-amz-checksum-sha1": "<redact>b+KCk=",
      "x-amz-server-side-encryption": "AES256",
      "server": "AmazonS3",
      "content-length": "0"
    },
    "RetryAttempts": 0
  },
  "ServerSideEncryption": "AES256",
  "ETag": ""<md5_here>"",
  "ChecksumSHA1": "<redact>b+KCk=" #<-- calculated by AWS for each part of the multipart_upload?
}

Edit 1:

@Quassnoi - Thank you for the insights, though, its still murky to me on these points:

1.) I'm using multipart_upload() of 500 MiB chunks. My understanding is that Etag != MD5 for any chunks larger than 16 MB. So does that mean that the SHA1 I calculate and apply to each chunk of the multipart_upload() != the SHA1 returned by AWS?

2.) I need to verify the integrity of the file as a whole, as part of the multipart_upload. Querying GetObjectProperties sounds like it returns the SHA1 that I uploaded with each chunk!

If you query your object with GetObjectProperties, it will return you the Checksum object with the field ChecksumSHA1 populated as follows: The base64-encoded, 160-bit SHA-1 digest of the object. This will only be present if it was uploaded with the object (Does this mean it returns MY SHA1 not the AWS-calculated SHA1?). With multipart uploads, this may not be a checksum value of the object. (Then how am I to compare chunk SHA1 I calculated to the AWS-calculated SHA during upload?) For more information about how checksums are calculated with multipart uploads, see Checking object integrity in the Amazon S3 User Guide. (This doc leads to even more questions.)

It would be helpful to see a boto3 example of checking file integrity, during multipart_uploads. Do you know of any examples? I have been unable to find one.


Solution

  • Is the SHA1 in the response actually calculated by AWS?

    Yes, it is calculated by AWS.

    If so, what happens when it doesn't match the SHA1 I calculated and sent to AWS?

    S3 will reject the upload and return the error status to this effect.


    To me, the docs are unclear.

    Here's the line in the excerpt you've quoted:

    The checksum for the completed object is calculated in the same way that Amazon S3 calculates the MD5 digest for the multipart upload.

    , which is referring to this part of the docs:

    When an object is uploaded as a multipart upload, the ETag for the object is not an MD5 digest of the entire object. Amazon S3 calculates the MD5 digest of each individual part as it is uploaded. The MD5 digests are used to determine the ETag for the final object. Amazon S3 concatenates the bytes for the MD5 digests together and then calculates the MD5 digest of these concatenated values. The final step for creating the ETag is when Amazon S3 adds a dash with the total number of parts to the end.

    If you query your object with GetObjectAttributes, it will return you the Checksum object with the field ChecksumSHA1 populated as follows:

    The base64-encoded, 160-bit SHA-1 digest of the object. This will only be present if it was uploaded with the object. With multipart uploads, this may not be a checksum value of the object. For more information about how checksums are calculated with multipart uploads, see Checking object integrity in the Amazon S3 User Guide.

    To verify the integrity of the object as a whole, you need to know the hashes of all the parts it was uploaded as.

    This also means that if you copy the object to another place as a whole (and not using multipart copy), its digests will change.

    Does this mean it returns MY SHA1 not the AWS-calculated SHA1?

    It's not a meaningful distinction. If your SHA1 didn't match what S3 had calculated, the upload would be rejected and nothing would be returned. If it returns something at all (basically as a courtesy), it matches what you provided in the header.

    Then how am I to compare chunk SHA1 I calculated to the AWS-calculated SHA during upload?

    To verify the object as a whole, you'll need to know the checksums of all of its parts. The formula is as follows:

    ChecksumSHA1(object) = SHA1(SHA1(parts[0]) || SHA1(parts[1]) || … || SHA1(parts[n])) || '-' || parts.length
    

    The individual checksums of the parts are also available in the output of GetObjectAttributes.

    The checksum flow is more usable for downloads, when you want to verify that's what ended up on your drive (with all its bad blocks and cosmic ray bit flips) matches what AWS has in the object metadata.

    Double-checking that past the check performed by S3 during the initial upload is a little bit on the paranoid side, but there is no law against you downloading what you had just uploaded and verifying it once again.