Search code examples
pythonamazon-s3python-multithreadingconcurrent.futures

Upload 2 million files (each approx 30 KB from a EC2 to S3 ) using concurrent.futures , ThreadPool takes a lot of time


We have a requirement where we need to upload about 2 million files (each approx 30 KB from a EC2 instance to S3). We are using python, boto3 and concurrent.futures modules in order to try to achieve this. The following is the pseudo code

import concurrent.futures
from concurrent.futures import ThreadPoolExecutor

class UploadToS3:

    def upload(self, file_path):
        try:
            s3 = boto3.resource('s3')
            bucket = s3.Bucket('xxxxxxxxxx')
            destination_file_path =  'yyyyy'
            bucket.upload_file(file_path,destination_file_path)
            del s3
        except (Exception)  as e :
            print(e)

    def upload_files(self, file_paths):
        with concurrent.futures.ThreadPoolExecutor(max_workers=2000) as executor:  
            tracker_futures = []  
            for file_path in file_paths:
                tracker_futures.append(executor.submit(self.upload,file_path)) 
        for future in concurrent.futures.as_completed(tracker_futures):
                tracker_futures.remove(future)
                del future

However we are finding out that we can upload only ~78000 files per hour, Increasing the number of thread does not have much effect, we believe its because of GIL , when we tried to use ProcessPoolExecutor, we ran in to issues because the boto3 objects are not Pickable. Any suggestions on how to overcome this scenario


Solution

  • Based on my general experience, that actually sounds pretty good - ~ 21 files per second.

    What might work better is to:

    • Zip (or otherwise smush together) the 2 million files into one giant archive file.
    • Upload that archive file to an EC2 instance in the same AWS data center as the S3 bucket.
    • Unzip the file on the EC2 instance.
    • Run the Python script on the EC2 instance.

    That will cut down on the roundtrip network time for each little S3 upload since everything will be inside AWS. However, you may still run into limits on the number of concurrent uploads and/or the number of uploads per second.

    In general - from DOS to Windows to Linux to S3, etc. - lots and lots of little files tend to take a lot longer to process/upload/etc. than the same amount of data in fewer, larger files.

    While S3 seems to do better than many other systems, you may also want to consider, if you have not already done so, setting up S3 folders so that the 2 million files are not in the (equivalent of) one directory. However, that may or may not be so easy to do depending on the naming scheme of the files and the ultimate use of the files.