Search code examples
c#amazon-web-servicesamazon-s3aws-sdkaws-sdk-net

How can I check that S3 has validated the checksums for my object when using the AWS SDK for .NET?


I'm uploading a file using PutObject which works, but how do I tell if the MD5 checksum has been verified?

var s3Client = new AmazonS3Client();

string base64Checksum;
using (var md5 = MD5.Create())
{
    byte[] fileBytes = File.ReadAllBytes(filePath);
    byte[] hash = md5.ComputeHash(fileBytes, 0, fileBytes.Length);
    base64Checksum = Convert.ToBase64String(hash);
}

var putRequest = new PutObjectRequest()
{
    BucketName = bucketName,
    Key = objectKey,
    FilePath = filePath,
    ContentType = "application/txt",
    MD5Digest = base64Checksum 
};

await s3Client.PutObjectAsync(putRequest);

And in the response, ResponseMetadata.ChecksumAlgorithm is set to NONE and ChecksumValidationStatus is NOT_VALIDATED.

Does this mean the MD5 hash I've provided has not been validated?

And alternatively, if I set ChecksumAlgorithm to ChecksumAlgorithm.SHA256:

var putRequest = new PutObjectRequest()
{
    // ...
    ChecksumAlgorithm = ChecksumAlgorithm.SHA256
};

The checksum is calculated by AWS, but ChecksumAlgorithm and ChecksumValidationStatus still remains as the above.

And even if I calculate it myself and set it:

var putRequest = new PutObjectRequest()
{
    // ...
    ChecksumSHA256 = sha256Checksum
};

I still get ChecksumAlgorithm set to NONE and ChecksumValidationStatus is NOT_VALIDATED.

What am I doing wrong?


Solution

  • Does this mean the MD5 hash I've provided has not been validated?

    No, it has been validated.

    MD5 checksum verification is done automatically by Amazon S3 based on a MD5 checksum that is sent via the Content-MD5 header.

    This value can be generated by the SDK or provided as part of the PutObject request, however the key is that regardless of who provides the MD5 digest - the verification is done by AWS as clearly stated in the docs:

    When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, Amazon S3 returns an error.

    If the MD5Digest provided is correct (which maps to the Content-MD5 header), the PutObjectRequest succeeds without any exceptions, signalling that the MD5 digest has been verified successfully.

    S3 now guarantees that what you think you've uploaded is actually what has been uploaded. The MD5 of your local object matches the MD5 calculated by S3 - great.

    Now, if the MD5Digest is incorrect (or the upload has been corrupted), the .NET SDK will throw this exception with error code of BadDigest:

    Amazon.S3.AmazonS3Exception: The Content-MD5 you specified did not match what we received.

    This is the .NET version of what happens, yet do note that the SDK is merely surfacing the S3 API's 400 Bad Request error.

    And by default, the SDK will generate an MD5 digest for you. This isn't calculated for you if either of the below conditions are true:

    • DisableMD5Stream is set to true
    • a value has been set for MD5Digest (the condition that is true for your code)

    You don't have to provide your own value.


    And alternatively, if I set ChecksumAlgorithm to ChecksumAlgorithm.SHA256, the checksum is calculated by AWS

    If you set the ChecksumAlgorithm to the desired algorithm, the AWS SDK will calculate the checksum. This calculated checksum is then included in the request sent to Amazon S3.

    However, this checksum is to not be mistaken with the checksum that Amazon S3 generates after receiving the request. Amazon S3 uses its checksum for cross-referencing against the checksum that has been sent in the request.


    With MD5 checksum verification aside, S3 announced support for 4 new checksum algorithms in Feb 2022, that can be used alongside the MD5 integrity check:

    • CRC32: x-amz-checksum-crc32
    • CRC32C: x-amz-checksum-crc32c
    • SHA1: x-amz-checksum-sha1
    • SHA256: x-amz-checksum-sha256

    As above, these checksums can also be calculated by the SDKs or provided by the user however once again - Amazon S3 checks the object against the provided checksum value and, if they do not match, Amazon S3 returns an error.

    Same format as above: if your CRC32/CRC32C/SHA1/SHA256 checksum value is incorrect, you'll get an exception with the error code of BadDigest and a message relating to whichever checksum algorithm you used.

    Amazon.S3.AmazonS3Exception: The SHA256 you specified did not match the calculated checksum.

    All of this has nothing to do with the SDK.

    The SDK either generates & sends the generated checksum value along with the right header name, just sends the checksum value that you've manually provided it with the right header name or just doesn't send the header (for no additional checksum verification).


    So what is ChecksumValidationStatus?

    If you take a look at S3 API's response object - which all of the SDKs are basically clients for - it's not actually there. It's a .NET-SDK-specific concept regarding the additional checksum algorithms & is not related to MD5 checksum validation whatsoever.

    The field is not related to Amazon S3's verification of the checksum value. That is denoted by the S3 response and in the .NET SDK's case: no exception = ✅ valid & verified checksum.

    So, let's say we upload object payroll.txt with a (dummy) SHA256 checksum value of a. No exceptions are thrown so we know that S3 has validated that my object in transit has not been corrupted as their calculated checksum value for payroll.txt is also a. We now are confident that S3 is truly storing payroll.txt as originally expected.

    On another device, we send a GetObjectRequest via the .NET SDK to download payroll.txt. We know that S3 is truly storing payroll.txt but how do we know that the .NET SDK has truly downloaded payroll.txt as intended?

    That's where the ChecksumValidationStatus comes into play, which should be checked on the GetObjectRequest. The fact that it is even accessible on the PutObjectResponse seems like a leaky abstraction to me.

    This is why even if we've specified an additional SHA256 checksum value to validate, the PutObjectResponse always has a status of NOT_VALIDATED. The SDK client doesn't even validate the checksum of the object on a PutObject for it to even make sense to have a status for it.

    With that out of the way, the field is so that the SDK can validate that it's downloaded the right object & that it hasn't been corrupted on the way. As long as you've set ChecksumMode on the request to ChecksumMode.ENABLED, the SDK will obtain & populate the checksum fields e.g. GetObjectResponse.ChecksumSHA256. ChecksumMode maps to the x-amz-checksum-mode header.

    Of course, you can then manaully verify this, but the SDK tries to help by aiming to change the status of ChecksumValidationStatus from PENDING_RESPONSE_READ (its initial value post GET) to either SUCCESSFUL or INVALID based on the hash it generates.

    It can only generate the hash of the downloaded object once it has been fully read i.e. on the closure of the ResponseStream (a standard .NET Stream).

    You can see this within the comment for the ChecksumValidationStatus enum based on the public source code:

    /// States for response checksum validation 
    public enum ChecksumValidationStatus
    {
        /// Set when the SDK did not perform checksum validation.
        NOT_VALIDATED,
    
        /// Set when a checksum was selected to be validated, but validation
        /// will not completed until the response stream is fully read. At that point an exception
        /// will be thrown if the checksum is invalid.
        PENDING_RESPONSE_READ,
    
        /// The checksum has been validated successfully during response unmarshalling.
        SUCCESSFUL,
    
        /// The checksum of the response stream did not match the header sent by the service.
        INVALID
    }
    

    What am I doing wrong?

    There seems to be a bug where the validation status on the GetRequest never changes from PENDING_RESPONSE_READ to SUCCESSFUL or even INVALID once the steam is fully closed.

    My sample code that demonstrates this:

    using Amazon.S3;
    using Amazon.S3.Model;
    
    var bucketName = "xyz";
    var filePath = $"{DateTimeOffset.Now.ToUnixTimeMilliseconds()}.txt";
    
    await File.WriteAllTextAsync(filePath, "my-test-content");
    
    var s3Client = new AmazonS3Client();
    
    var putObject = new PutObjectRequest()
    {
        BucketName = bucketName,
        Key = filePath,
        FilePath = filePath,
        ContentType = "application/txt",
        ChecksumAlgorithm = ChecksumAlgorithm.SHA256,
    };
    
    Console.WriteLine($"Uploading object with key: {filePath}");
    Console.WriteLine("---");
    await s3Client.PutObjectAsync(putObject);
    
    var getObject = new GetObjectRequest()
    {
        BucketName = bucketName,
        Key = filePath,
        ChecksumMode = ChecksumMode.ENABLED,
    };
    
    Console.WriteLine($"Getting object with key: {filePath}");
    Console.WriteLine("---");
    var getResponse = await s3Client.GetObjectAsync(getObject);
    
    Console.WriteLine($"GET response SHA256: {getResponse.ChecksumSHA256}");
    Console.WriteLine($"Response stream CanRead status (not closed): {getResponse.ResponseStream.CanRead}");
    Console.WriteLine($"GET response checksum validation status: {getResponse.ResponseMetadata.ChecksumValidationStatus}");
    Console.WriteLine("---");
    
    
    Console.WriteLine($"Reading stream...");
    using (var reader = new StreamReader(getResponse.ResponseStream))
    {
        var content = await reader.ReadToEndAsync();
        Console.WriteLine($"Stream contents: {content}");
    }
    
    Console.WriteLine("---");
    
    Console.WriteLine($"Response stream CanRead status (not closed): {getResponse.ResponseStream.CanRead}");
    Console.WriteLine($"GET response checksum validation status: {getResponse.ResponseMetadata.ChecksumValidationStatus}");
    

    Output:

    Uploading object with key: 1702578364813.txt
    ---
    Getting object with key: 1702578364813.txt
    ---
    GET response SHA256: q7IK7CFDRfD5yHQ4kFLUm6PaH1qQVdUvT+1jR3NAw/4=
    Response stream CanRead status (not closed): True
    GET response checksum validation status: PENDING_RESPONSE_READ
    ---
    Reading stream...
    Stream contents: my-test-content
    ---
    Response stream CanRead status (not closed): False
    GET response checksum validation status: PENDING_RESPONSE_READ
    

    The 2nd:

    GET response checksum validation status: PENDING_RESPONSE_READ

    should be:

    GET response checksum validation status: SUCCESSFUL

    You should depend on a AmazonClientException instead.

    I reached out to the AWS SDK for .NET team for confirmation that this field is currently unused:

    Generally for operations in the SDK that work with streams, we strive to avoid buffering the stream into memory and/or doing multiple passes on the stream and rewinding. So for response checksums, we calculate the client-side checksum as the user is reading the stream.

    That happens from the user's code, after we've passed through the UnmarshallerContext. So you're correct that it's a leaky abstraction and we never update ChecksumValidationStatus for S3's GetObject after the stream is read.

    We would for an hypothetical operation that returns a string instead of a stream, since we would do an initial "pass" to calculate the checksum during the SDK's internal unmarshalling before returning the response object to the user.

    What does happen for streaming operations is when we've finished reading the stream

    We would throw an AmazonClientException if the hash we calculated while you were reading the stream doesn't match what the service said to expect.

    try 
    { 
        var content = await reader.ReadToEndAsync();
    }
    catch(AmazonClientException ex)
    {
       // here one could handle a checksum mismatch
    }
    

    Currently it's unused, as S3's GetObject is the only operation that uses this feature in the SDK. I agree that's leaky, it might be worth us taking another look at clarifying the ChecksumValidationStatus documentation and/or updating the status after the stream read if possible.

    We try to design new SDK features and modeling traits to allow other services to adopt them in the future. If another operation adopted it with a non-streaming response, it should begin working.


    In conclusion:

    • Amazon S3 can optionally validate the MD5 checksum for objects and/or one extra checksum value to ensure integrity of the object uploaded

    • The extra checksum values can be of the following algorithms: MD5, CRC32, CRC32C, SHA1 or SHA256

    • The checksum values can be provided by the user, or generated by the SDK depending on SDK configuration

    • No errors returned by the S3 API on the PutObject request means that S3 has verified the checksum of the object successafully

    • The SDK implementation may offer the option of validating the checksum value that is returned by the API, as long as the SDK has been configured to obtain the checksum value from S3 by setting x-amz-checksum-mode to ENABLED

    • The .NET SDKs ChecksumValidationStatus field is currently a field that shouldn't be used so catch AmazonClientExceptions instead