We have a requirement where we need to upload about 2 million files (each approx 30 KB from a EC2 instance to S3). We are using python, boto3 and concurrent.futures modules in order to try to achieve this. The following is the pseudo code
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
class UploadToS3:
def upload(self, file_path):
try:
s3 = boto3.resource('s3')
bucket = s3.Bucket('xxxxxxxxxx')
destination_file_path = 'yyyyy'
bucket.upload_file(file_path,destination_file_path)
del s3
except (Exception) as e :
print(e)
def upload_files(self, file_paths):
with concurrent.futures.ThreadPoolExecutor(max_workers=2000) as executor:
tracker_futures = []
for file_path in file_paths:
tracker_futures.append(executor.submit(self.upload,file_path))
for future in concurrent.futures.as_completed(tracker_futures):
tracker_futures.remove(future)
del future
However we are finding out that we can upload only ~78000 files per hour, Increasing the number of thread does not have much effect, we believe its because of GIL , when we tried to use ProcessPoolExecutor, we ran in to issues because the boto3 objects are not Pickable. Any suggestions on how to overcome this scenario
Based on my general experience, that actually sounds pretty good - ~ 21 files per second.
What might work better is to:
That will cut down on the roundtrip network time for each little S3 upload since everything will be inside AWS. However, you may still run into limits on the number of concurrent uploads and/or the number of uploads per second.
In general - from DOS to Windows to Linux to S3, etc. - lots and lots of little files tend to take a lot longer to process/upload/etc. than the same amount of data in fewer, larger files.
While S3 seems to do better than many other systems, you may also want to consider, if you have not already done so, setting up S3 folders so that the 2 million files are not in the (equivalent of) one directory. However, that may or may not be so easy to do depending on the naming scheme of the files and the ultimate use of the files.