Search code examples
pythonamazon-web-servicesamazon-s3amazon-ec2boto3

Boto3 S3 Multipart Download of a Large Byte Range


I have a set of 4GB files stored in S3 that I need to extract 1GB sections from. I know that I can accomplish this via a boto3 S3 ranged get request:

import boto3

s3 = boto3.client('s3')
bucket = ''
key = ''
start = 100_0000_000
end = 200_0000_000
response = s3.get_object(Bucket=bucket, Key=key, Range=f'bytes={start}-{end}')

However, this download is slow because I am not taking advantage of S3's multipart download functionality. I understand how to perform multipart downloads using boto3's s3.Object.download_file() method, but I can't figure out how to specify an overall byte range for this method call.

When downloading large ranges of a file from S3, what is the fastest and cleanest way to perform multipart downloads? Assume that this is running on an EC2 instance in the same region as the S3 bucket.


Solution

  • I have come up with a working solution using a ThreadPoolExecutor, but I believe it can still be improved. The best approach I found was to set up a thread pool of s3_client.get_object calls that have the range parameter specified:

    import math
    from concurrent.futures import ThreadPoolExecutor
    
    import boto3
    
    KB = 1024
    MB = KB * KB
    
    
    def calculate_range_parameters(offset, length, chunk_size):
        num_parts = int(math.ceil(length / float(chunk_size)))
        range_params = []
        for part_index in range(num_parts):
            start_range = (part_index * chunk_size) + offset
            if part_index == num_parts - 1:
                end_range = str(length + offset - 1)
            else:
                end_range = start_range + chunk_size - 1
    
            range_params.append(f'bytes={start_range}-{end_range}')
        return range_params
    
    
    def s3_ranged_get(args):
        s3_client, bucket, key, range_header = args
        resp = s3_client.get_object(Bucket=bucket, Key=key, Range=range_header)
        body = resp['Body'].read()
        return body
    
    
    def threaded_s3_get(s3_client, bucket, key, offset, length, chunksize=10 * MB):
        args_list = [(s3_client, bucket, key, x) for x in calculate_range_parameters(offset, length, chunksize)]
    
        # Dispatch work tasks with our client
        with ThreadPoolExecutor(max_workers=20) as executor:
            results = executor.map(s3_ranged_get, args_list)
    
        content = b''.join(results)
        return content
    
    
    s3 = boto3.client('s3')
    bucket = ''
    key = ''
    
    content = threaded_s3_get(s3, bucket, key, 1 * MB, 101 * MB)
    with open('data.bin', 'wb') as f:
        f.write(content)
    

    calculate_range_parameters creates a list of range argument inputs given a file offset, length, and chunksize, s3_ranged_get wraps the boto3 s3-client get_object method, and threaded_s3_get sets up the ThreadPoolExecutor. When accessing a 1.3 GB region of data in an open bucket on an in-region r5d.xlarge EC2 instance, this code will download the data in 4.76 seconds. For comparison, using the boto3-native multipart download functionality to download the same amount of data under the same conditions takes 3.96 seconds (i.e. my solution takes 1.2x the time of the native solution).

    This solution will work for now, but long-term it would be great to see boto3 support multipart reads of large byte ranges natively.