I have a set of 4GB files stored in S3 that I need to extract 1GB sections from. I know that I can accomplish this via a boto3
S3 ranged get request:
import boto3
s3 = boto3.client('s3')
bucket = ''
key = ''
start = 100_0000_000
end = 200_0000_000
response = s3.get_object(Bucket=bucket, Key=key, Range=f'bytes={start}-{end}')
However, this download is slow because I am not taking advantage of S3's multipart download functionality. I understand how to perform multipart downloads using boto3
's s3.Object.download_file()
method, but I can't figure out how to specify an overall byte range for this method call.
When downloading large ranges of a file from S3, what is the fastest and cleanest way to perform multipart downloads? Assume that this is running on an EC2 instance in the same region as the S3 bucket.
I have come up with a working solution using a ThreadPoolExecutor
, but I believe it can still be improved. The best approach I found was to set up a thread pool of s3_client.get_object
calls that have the range parameter specified:
import math
from concurrent.futures import ThreadPoolExecutor
import boto3
KB = 1024
MB = KB * KB
def calculate_range_parameters(offset, length, chunk_size):
num_parts = int(math.ceil(length / float(chunk_size)))
range_params = []
for part_index in range(num_parts):
start_range = (part_index * chunk_size) + offset
if part_index == num_parts - 1:
end_range = str(length + offset - 1)
else:
end_range = start_range + chunk_size - 1
range_params.append(f'bytes={start_range}-{end_range}')
return range_params
def s3_ranged_get(args):
s3_client, bucket, key, range_header = args
resp = s3_client.get_object(Bucket=bucket, Key=key, Range=range_header)
body = resp['Body'].read()
return body
def threaded_s3_get(s3_client, bucket, key, offset, length, chunksize=10 * MB):
args_list = [(s3_client, bucket, key, x) for x in calculate_range_parameters(offset, length, chunksize)]
# Dispatch work tasks with our client
with ThreadPoolExecutor(max_workers=20) as executor:
results = executor.map(s3_ranged_get, args_list)
content = b''.join(results)
return content
s3 = boto3.client('s3')
bucket = ''
key = ''
content = threaded_s3_get(s3, bucket, key, 1 * MB, 101 * MB)
with open('data.bin', 'wb') as f:
f.write(content)
calculate_range_parameters
creates a list of range argument inputs given a file offset, length, and chunksize, s3_ranged_get
wraps the boto3
s3-client get_object
method, and threaded_s3_get
sets up the ThreadPoolExecutor
. When accessing a 1.3 GB region of data in an open bucket on an in-region r5d.xlarge EC2 instance, this code will download the data in 4.76 seconds. For comparison, using the boto3
-native multipart download functionality to download the same amount of data under the same conditions takes 3.96 seconds (i.e. my solution takes 1.2x the time of the native solution).
This solution will work for now, but long-term it would be great to see boto3
support multipart reads of large byte ranges natively.