I'm uploading a file using PutObject
which works, but how do I tell if the MD5 checksum has been verified?
var s3Client = new AmazonS3Client();
string base64Checksum;
using (var md5 = MD5.Create())
{
byte[] fileBytes = File.ReadAllBytes(filePath);
byte[] hash = md5.ComputeHash(fileBytes, 0, fileBytes.Length);
base64Checksum = Convert.ToBase64String(hash);
}
var putRequest = new PutObjectRequest()
{
BucketName = bucketName,
Key = objectKey,
FilePath = filePath,
ContentType = "application/txt",
MD5Digest = base64Checksum
};
await s3Client.PutObjectAsync(putRequest);
And in the response, ResponseMetadata.ChecksumAlgorithm
is set to NONE
and ChecksumValidationStatus
is NOT_VALIDATED
.
Does this mean the MD5 hash I've provided has not been validated?
And alternatively, if I set ChecksumAlgorithm
to ChecksumAlgorithm.SHA256
:
var putRequest = new PutObjectRequest()
{
// ...
ChecksumAlgorithm = ChecksumAlgorithm.SHA256
};
The checksum is calculated by AWS, but ChecksumAlgorithm
and ChecksumValidationStatus
still remains as the above.
And even if I calculate it myself and set it:
var putRequest = new PutObjectRequest()
{
// ...
ChecksumSHA256 = sha256Checksum
};
I still get ChecksumAlgorithm
set to NONE
and ChecksumValidationStatus
is NOT_VALIDATED
.
What am I doing wrong?
Does this mean the MD5 hash I've provided has not been validated?
No, it has been validated.
MD5 checksum verification is done automatically by Amazon S3 based on a MD5 checksum that is sent via the Content-MD5
header.
This value can be generated by the SDK or provided as part of the PutObject
request, however the key is that regardless of who provides the MD5 digest - the verification is done by AWS as clearly stated in the docs:
When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, Amazon S3 returns an error.
If the MD5Digest
provided is correct (which maps to the Content-MD5
header), the PutObjectRequest
succeeds without any exceptions, signalling that the MD5 digest has been verified successfully.
S3 now guarantees that what you think you've uploaded is actually what has been uploaded. The MD5 of your local object matches the MD5 calculated by S3 - great.
Now, if the MD5Digest
is incorrect (or the upload has been corrupted), the .NET SDK will throw this exception with error code of BadDigest
:
Amazon.S3.AmazonS3Exception: The Content-MD5 you specified did not match what we received.
This is the .NET version of what happens, yet do note that the SDK is merely surfacing the S3 API's 400 Bad Request
error.
And by default, the SDK will generate an MD5 digest for you. This isn't calculated for you if either of the below conditions are true:
DisableMD5Stream
is set to true
MD5Digest
(the condition that is true for your code)You don't have to provide your own value.
And alternatively, if I set
ChecksumAlgorithm
toChecksumAlgorithm.SHA256
, the checksum is calculated by AWS
If you set the ChecksumAlgorithm
to the desired algorithm, the AWS SDK will calculate the checksum. This calculated checksum is then included in the request sent to Amazon S3.
However, this checksum is to not be mistaken with the checksum that Amazon S3 generates after receiving the request. Amazon S3 uses its checksum for cross-referencing against the checksum that has been sent in the request.
With MD5 checksum verification aside, S3 announced support for 4 new checksum algorithms in Feb 2022, that can be used alongside the MD5 integrity check:
x-amz-checksum-crc32
x-amz-checksum-crc32c
x-amz-checksum-sha1
x-amz-checksum-sha256
As above, these checksums can also be calculated by the SDKs or provided by the user however once again - Amazon S3 checks the object against the provided checksum value and, if they do not match, Amazon S3 returns an error.
Same format as above: if your CRC32/CRC32C/SHA1/SHA256 checksum value is incorrect, you'll get an exception with the error code of BadDigest
and a message relating to whichever checksum algorithm you used.
Amazon.S3.AmazonS3Exception: The SHA256 you specified did not match the calculated checksum.
All of this has nothing to do with the SDK.
The SDK either generates & sends the generated checksum value along with the right header name, just sends the checksum value that you've manually provided it with the right header name or just doesn't send the header (for no additional checksum verification).
So what is ChecksumValidationStatus
?
If you take a look at S3 API's response object - which all of the SDKs are basically clients for - it's not actually there. It's a .NET-SDK-specific concept regarding the additional checksum algorithms & is not related to MD5 checksum validation whatsoever.
The field is not related to Amazon S3's verification of the checksum value. That is denoted by the S3 response and in the .NET SDK's case: no exception = ✅ valid & verified checksum.
So, let's say we upload object payroll.txt
with a (dummy) SHA256 checksum value of a
. No exceptions are thrown so we know that S3 has validated that my object in transit has not been corrupted as their calculated checksum value for payroll.txt
is also a
. We now are confident that S3 is truly storing payroll.txt
as originally expected.
On another device, we send a GetObjectRequest
via the .NET SDK to download payroll.txt
. We know that S3 is truly storing payroll.txt
but how do we know that the .NET SDK has truly downloaded payroll.txt
as intended?
That's where the ChecksumValidationStatus
comes into play, which should be checked on the GetObjectRequest
. The fact that it is even accessible on the PutObjectResponse
seems like a leaky abstraction to me.
This is why even if we've specified an additional SHA256 checksum value to validate, the PutObjectResponse
always has a status of NOT_VALIDATED
. The SDK client doesn't even validate the checksum of the object on a PutObject
for it to even make sense to have a status for it.
With that out of the way, the field is so that the SDK can validate that it's downloaded the right object & that it hasn't been corrupted on the way. As long as you've set ChecksumMode
on the request to ChecksumMode.ENABLED
, the SDK will obtain & populate the checksum fields e.g. GetObjectResponse.ChecksumSHA256
. ChecksumMode
maps to the x-amz-checksum-mode
header.
Of course, you can then manaully verify this, but the SDK tries to help by aiming to change the status of ChecksumValidationStatus
from PENDING_RESPONSE_READ
(its initial value post GET) to either SUCCESSFUL
or INVALID
based on the hash it generates.
It can only generate the hash of the downloaded object once it has been fully read i.e. on the closure of the ResponseStream
(a standard .NET Stream
).
You can see this within the comment for the ChecksumValidationStatus
enum based on the public source code:
/// States for response checksum validation
public enum ChecksumValidationStatus
{
/// Set when the SDK did not perform checksum validation.
NOT_VALIDATED,
/// Set when a checksum was selected to be validated, but validation
/// will not completed until the response stream is fully read. At that point an exception
/// will be thrown if the checksum is invalid.
PENDING_RESPONSE_READ,
/// The checksum has been validated successfully during response unmarshalling.
SUCCESSFUL,
/// The checksum of the response stream did not match the header sent by the service.
INVALID
}
What am I doing wrong?
There seems to be a bug where the validation status on the GetRequest
never changes from PENDING_RESPONSE_READ
to SUCCESSFUL
or even INVALID
once the steam is fully closed.
My sample code that demonstrates this:
using Amazon.S3;
using Amazon.S3.Model;
var bucketName = "xyz";
var filePath = $"{DateTimeOffset.Now.ToUnixTimeMilliseconds()}.txt";
await File.WriteAllTextAsync(filePath, "my-test-content");
var s3Client = new AmazonS3Client();
var putObject = new PutObjectRequest()
{
BucketName = bucketName,
Key = filePath,
FilePath = filePath,
ContentType = "application/txt",
ChecksumAlgorithm = ChecksumAlgorithm.SHA256,
};
Console.WriteLine($"Uploading object with key: {filePath}");
Console.WriteLine("---");
await s3Client.PutObjectAsync(putObject);
var getObject = new GetObjectRequest()
{
BucketName = bucketName,
Key = filePath,
ChecksumMode = ChecksumMode.ENABLED,
};
Console.WriteLine($"Getting object with key: {filePath}");
Console.WriteLine("---");
var getResponse = await s3Client.GetObjectAsync(getObject);
Console.WriteLine($"GET response SHA256: {getResponse.ChecksumSHA256}");
Console.WriteLine($"Response stream CanRead status (not closed): {getResponse.ResponseStream.CanRead}");
Console.WriteLine($"GET response checksum validation status: {getResponse.ResponseMetadata.ChecksumValidationStatus}");
Console.WriteLine("---");
Console.WriteLine($"Reading stream...");
using (var reader = new StreamReader(getResponse.ResponseStream))
{
var content = await reader.ReadToEndAsync();
Console.WriteLine($"Stream contents: {content}");
}
Console.WriteLine("---");
Console.WriteLine($"Response stream CanRead status (not closed): {getResponse.ResponseStream.CanRead}");
Console.WriteLine($"GET response checksum validation status: {getResponse.ResponseMetadata.ChecksumValidationStatus}");
Output:
Uploading object with key: 1702578364813.txt
---
Getting object with key: 1702578364813.txt
---
GET response SHA256: q7IK7CFDRfD5yHQ4kFLUm6PaH1qQVdUvT+1jR3NAw/4=
Response stream CanRead status (not closed): True
GET response checksum validation status: PENDING_RESPONSE_READ
---
Reading stream...
Stream contents: my-test-content
---
Response stream CanRead status (not closed): False
GET response checksum validation status: PENDING_RESPONSE_READ
The 2nd:
GET response checksum validation status: PENDING_RESPONSE_READ
should be:
GET response checksum validation status: SUCCESSFUL
You should depend on a AmazonClientException
instead.
I reached out to the AWS SDK for .NET team for confirmation that this field is currently unused:
Generally for operations in the SDK that work with streams, we strive to avoid buffering the stream into memory and/or doing multiple passes on the stream and rewinding. So for response checksums, we calculate the client-side checksum as the user is reading the stream.
That happens from the user's code, after we've passed through the
UnmarshallerContext
. So you're correct that it's a leaky abstraction and we never updateChecksumValidationStatus
for S3'sGetObject
after the stream is read.We would for an hypothetical operation that returns a string instead of a stream, since we would do an initial "pass" to calculate the checksum during the SDK's internal unmarshalling before returning the response object to the user.
What does happen for streaming operations is when we've finished reading the stream
- Either the final 0 byte read
- Or during disposal
We would throw an
AmazonClientException
if the hash we calculated while you were reading the stream doesn't match what the service said to expect.
try
{
var content = await reader.ReadToEndAsync();
}
catch(AmazonClientException ex)
{
// here one could handle a checksum mismatch
}
Currently it's unused, as S3's
GetObject
is the only operation that uses this feature in the SDK. I agree that's leaky, it might be worth us taking another look at clarifying theChecksumValidationStatus
documentation and/or updating the status after the stream read if possible.We try to design new SDK features and modeling traits to allow other services to adopt them in the future. If another operation adopted it with a non-streaming response, it should begin working.
In conclusion:
Amazon S3 can optionally validate the MD5 checksum for objects and/or one extra checksum value to ensure integrity of the object uploaded
The extra checksum values can be of the following algorithms: MD5, CRC32, CRC32C, SHA1 or SHA256
The checksum values can be provided by the user, or generated by the SDK depending on SDK configuration
No errors returned by the S3 API on the PutObject
request means that S3 has verified the checksum of the object successafully
The SDK implementation may offer the option of validating the checksum value that is returned by the API, as long as the SDK has been configured to obtain the checksum value from S3 by setting x-amz-checksum-mode
to ENABLED
The .NET SDKs ChecksumValidationStatus
field is currently a field that shouldn't be used so catch AmazonClientException
s instead