Search code examples
amazon-web-servicesamazon-s3etag

AWS S3 Etag for multipart upload with "-1"


I have some files on AWS S3, and run a script to validate the inventory Etags with what is on the local machines.

Generally, everything works fine and as expected. However, some files, for some unknown reason, are listed in the inventory as:

2613248,969a0e282c595da98f4aad489b90580e-1,TRUE

So, file size of 2613248, Etag of 969a0e282c595da98f4aad489b90580e-1, and TRUE means it is a multipart upload. However, it is -1, which means only 1 part, so why is it multipart? My local Etag calculator works for all files except for these with -1. How are the Etags calculated for these? I have tried many things but cannot figure it out, and there doesn't seem to be any documentation on AWS about it that I can find.

Tried calculating Etags different ways, but nothing matches.


Solution

  • You're seeing the result of a multipart upload with one part. S3 allows this, even if it's not necessarily to use instead of a non-multipart upload. Different SDKs will end up following this pattern in different situations, notably boto3 will fall back to this behavior if an object is exactly the size of the multipart threshold.

    The formula to calculate an ETag is the same for a multipart object with one part or multiple parts. It's the md5sum of the md5sum of each part concatenated together. It's just that with one part that means it's the md5sum of the md5sum of the single part.

    As an example, here's a single-part upload with Python, and the resulting multipart calculation both resulting in the same ETag:

    import boto3
    from hashlib import md5
    
    # Sample data to upload to S3 for this example, just a byte array of zero
    body = b'\x00' * 2613248
    
    # Manually create a multipart upload of one part
    # Do this manually so we can force the multipart upload and ensure one part is used
    s3 = boto3.client('s3')
    
    resp = s3.create_multipart_upload(Bucket=example_bucket, Key=example_key)
    upload_id = resp['UploadId']
    resp = s3.upload_part(
        Bucket=example_bucket, Key=example_key, 
        PartNumber=1, UploadId=upload_id, Body=body,
    )
    part_etag = resp['ETag']
    resp = s3.complete_multipart_upload(
        Bucket=example_bucket, Key=example_key, 
        UploadId=upload_id, 
        MultipartUpload={"Parts": [{"PartNumber": 1, "ETag": part_etag}]},
    )
    final_etag = resp['ETag']
    
    # Show the final ETag that S3 generated
    print(final_etag)
    # Outputs: "16308760353355915d657a6f32d6d6f1-1"
    
    # Manually calculate the ETag.
    # Note that this basic pattern is the same as a multipart ETag when
    # there are more than one parts, namely, calculate the hash for
    # each part, then generate a hash for the concatenation of all of the
    # previous hashes, and append the number of hashes from the first part
    # on the string
    
    # The list of hashes of each part, in this case, a list of one item
    part_digests = [
        md5(body).digest(),
    ]
    num_parts = len(part_digests)
    # Join all of the parts together
    part_digests = b''.join(part_digests)
    # Calculate the hash of all of the hashes
    etag = md5(part_digests).hexdigest()
    # Quote the final ETag to match AWS's format, and add the number of parts
    etag = f'"{etag}-{num_parts}"'
    print(etag)
    # Outputs: "16308760353355915d657a6f32d6d6f1-1"
    
    # Proof that our formula matches AWS:
    print(etag == final_etag)
    # Outputs: True