amazon-web-services amazon-s3 boto3 aws-cli boto

Faster way to Copy S3 files

I am trying to copy around 50 million files and 15TB in total size from one s3 bucket to another bucket. There are AWS CLI option to copy fast. But in my case, I want to put a filter and date range. So I thought to write code by using boto3.

The source bucket input structure:

Folder1
    File1 - Date1
    File2 - Date1
Folder2
    File1 - Date2
    File2 - Date2
Folder3
    File1_Number1 - Date3
    File2_Number1 - Date3
Folder4
    File1_Number1 - Date2
    File2_Number1 - Date2
Folder5
    File1_Number2 - Date4
    File2_Number2 - Date4

So the purpose is to copy all files which start with 'File1' from each folder by using a date range(Date2 to Date4). date(Date1, Date2, Date3, Date4) is file modified date.

The output would have date key partition and I am using UUID to keep every file name unique so it would never replace the existing file. So the files which have an identical date(modified date of the file) will be in the same folder.

Target Bucket would have output:

Date2
    File1_UUID1
    File1_Number1_UUID2
Date3
    File1_Number1_UUID3
Date4
    File1_Number2_UUID4

I have written code by using boto3 API and AWS glue to run the code. But boto3 API copies 500 thousand files every day.

The code:

s3 = boto3.resource('s3', region_name='us-east-2', config=boto_config)

# source and target bucket names
src_bucket_name = 'staging1'
trg_bucket_name = 'staging2'

# source and target bucket pointers
s3_src_bucket = s3.Bucket(src_bucket_name)
print('Source Bucket Name : {0}'.format(s3_src_bucket.name))
s3_trg_bucket = s3.Bucket(trg_bucket_name)
print('Target Bucket Name : {0}'.format(s3_trg_bucket.name))

# source and target directories
trg_dir = 'api/requests'

# source objects
s3_src_bucket_objs = s3_src_bucket.objects.all()

# Request file name prefix
file_prefix = 'File1'

# filter - start and end date
start_date = datetime.datetime.strptime("2019-01-01", "%Y-%m-%d").replace(tzinfo=None)
end_date = datetime.datetime.strptime("2020-06-15", "%Y-%m-%d").replace(tzinfo=None)

# iterates each source directory
for iterator_obj in s3_src_bucket_objs:
    file_path_key = iterator_obj.key
    date_key = iterator_obj.last_modified.replace(tzinfo=None)
    if start_date <= date_key <= end_date and file_prefix in file_path_key:
        # file name. It start with value of file_prefix.
        uni_uuid = uuid.uuid4()
        src_file_name = '{}_{}'.format(file_path_key.split('/')[-1], uni_uuid)

        # construct target directory path
        trg_dir_path = '{0}/datekey={1}'.format(trg_dir, date_key.date())

        # source file
        src_file_ref = {
            'Bucket': src_bucket_name,
            'Key': file_path_key
        }

        # target file path
        trg_file_path = '{0}/{1}'.format(trg_dir_path, src_file_name)

        # copy source file to target
        trg_new_obj = s3_trg_bucket.Object(trg_file_path)

        trg_new_obj.copy(src_file_ref, ExtraArgs=extra_args, Config=transfer_config)

# happy ending

Do we have any other way to make it fast or any alternative way to copy files in such target structure? Do you have any suggestions to improve the code? I am looking for some faster way to copy files. Your input would be valuable. Thank you!

Solution

The most likely reason that you can only copy 500k objects per day (thus taking about 3-4 months to copy 50M objects, which is absolutely unreasonable) is because you're doing the operations sequentially.

The vast majority of the time your code is running is spent waiting for the S3 Copy Object request to be sent to S3, processed by S3 (i.e., copying the object), and then sending the response back to you. On average, this is taking around 160ms per object (500k/day == approx. 1 per 160ms), which is reasonable.

To dramatically improve the performance of your copy operation, you should simply parallelize it: make many threads run the copies concurrently.

Once the Copy commands are not the bottleneck anymore (i.e., after you make them run concurrently), you'll encounter another bottleneck: the List Objects requests. This request runs sequentially, and returns only up to 1k keys per page, so you'll end up having to send around 50k List Object requests sequentially with the straightforward, naive code (here, "naive" == list without any prefix or delimiter, wait for the response, and list again with the provided next continuation token to get the next page).

Two possible solutions for the ListObjects bottleneck:

If you know the structure of your bucket pretty well (i.e., the "names of the folders", statistics on the distribution of "files" within those "folders", etc), you could try to parallelize the ListObjects requests by making each thread list a given prefix. Note that this is not a general solution, and requires intimate knowledge of the structure of the bucket, and also usually only works well if the bucket's structure had been planned out originally to support this kind of operation.
Alternatively, you can ask S3 to generate an inventory of your bucket. You'll have to wait at most 1 day, but you'll end up with CSV files (or ORC, or Parquet) containing information about all the objects in your bucket.

Either way, once you have the list of objects, you can have your code read the inventory (e.g., from local storage such as your local disk if you can download and store the files, or even by just sending a series of ListObjects and GetObject requests to S3 to retrieve the inventory), and then spin up a bunch of worker threads and run the S3 Copy Object operation on the objects, after deciding which ones to copy and the new object keys (i.e., your logic).

In short:

grab a list of all the objects first;
then launch many workers to run the copies.

One thing to watch out for here is if you launch an absurdly high number of workers and they all end up hitting the exact same partition of S3 for the copies. In such a scenario, you could end up getting some errors from S3. To reduce the likelihood of this happening, here are some things you can do:

instead of going sequentially over your list of objects, you could randomize it. E.g., load the inventory, put the items into a queue in a random order, and then have your workers consume from that queue. This will decrease the likelihood of overheating a single S3 partition
keep your workers to not more than a few hundred (a single S3 partition should be able to easily keep up with many hundreds of requests per second).

Final note: there's another thing to consider which is whether or not the bucket may be modified during your copy operation. If it could be modified, then you'll need a strategy to deal with objects that might not be copied because they weren't listed, or with objects that were copied by your code but got deleted from the source.