I am looking for a command line tool or a Python library which allows uploading big files to S3, with hash verification.
There is an AWS article explaining how it can be done automatically by supplying a content-md5
header.
Yet, it is not clear which command line tools do or do not do this:
rclone's documentation states that
files uploaded with multipart upload don’t have an MD5SUM.
s3cmd doesn't say anything about this, but it supports md5 for the sync feature
s4cmd has a whole paragraph in the manual, but it's still not clear if an upload is actually verified
boto3 / s3transfer's upload_file()
method doesn't really say anything
Do you have information about any of these tools, or some other tool or Python library or boto3 snippet which handles big file uploads to s3 with the reliability of rsync?
After asking the authors of the official aws cli
(boto3) tool, I can conclude that aws cli
always verifies every upload, including multi-part ones.
It does it chunk-by-chunk, using the official MD5 ETag verification for single-part uploads. Additionally, you can also enable SHA256 verification, still chunk-by-chunk.
aws cli
does not however verify the whole, assembled file. For that, you'd need to use some tiny Python function for example:
def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
md5s = []
with open(file_path, 'rb') as fp:
while True:
data = fp.read(chunk_size)
if not data:
break
md5s.append(hashlib.md5(data))
if len(md5s) == 1:
return '"{}"'.format(md5s[0].hexdigest())
digests = b''.join(m.digest() for m in md5s)
digests_md5 = hashlib.md5(digests)
return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))