I have an S3 bucket 'data' with a single directory '20230225' which contains JSON and video files. Within '20230225', I created a subdirectory 'metadata' in which I wanted to move the JSON files so as to have all the JSON and video files in separate directories.
I wrote a function to copy the JSON files to another directory which seemed to work on a small sample of the data. However when I ran the function on the totality of the JSON files in '20230225', it took much longer than I expected it to. I interrupted the execution of the function and when I counted the files in the destination directory, a there were many more json files than were supposed to be there.
Here is the function code. Is there anything in the there that could add some extra files?
I'm thinking it could be because the source folder is going to be all the subdirectories as well, and the only source folder's subdirectory is actually the destination folder, so maybe the function got stuck in a loop trying to copy the files from the destination folder it had already copied.
However even if that's the case shouldn't it just overwrite those files and not add extra files?
def copy_json_files(s3_bucket: str, source_folder: str, dest_folder: str):
"""
Parameters:
- s3_bucket (str): The name of the S3 bucket.
- source_folder (str): The name of the source folder.
- dest_folder (str): The name of the destination folder.
Returns:
- int: The number of files copied.
"""
s3 = boto3.resource('s3')
src_bucket = s3.Bucket(s3_bucket)
# Create destination prefix
dest_prefix = dest_folder.strip('/') + '/' if dest_folder else ''
# Configure S3 transfer manager
botocore_config = botocore.config.Config(max_pool_connections=200)
s3client = boto3.client('s3', config=botocore_config)
transfer_config = s3transfer.TransferConfig(use_threads=True, max_concurrency=140)
# Create S3 transfer manager
s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
copied_files = 0
for obj in src_bucket.objects.filter(Prefix=source_folder):
# Exclude objects in subdirectories of source folder
if '/' in obj.key[len(source_folder):]:
continue
# Exclude objects already in the destination folder
if obj.key.startswith(dest_prefix):
continue
if obj.key.endswith('.json'):
# Form destination key by replacing source folder name with destination folder name
dest_key = obj.key.replace(source_folder, dest_prefix, 1)
copy_source = {
'Bucket': s3_bucket,
'Key': obj.key
}
s3t.copy(
copy_source=copy_source,
bucket=s3_bucket,
key=dest_key
)
copied_files += 1
# Close transfer manager
s3t.shutdown()
return copied_files
The function I used to check the number of files was:
def count_files(s3_bucket, s3_dir):
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(s3_bucket)
count = 0
for obj in bucket.objects.filter(Prefix=s3_dir):
count += 1
return count
Objects in the sub-folders would be included in the object listing.
For example, if the source has one object and your code is run, it would copy that object to the sub-directory. The next time it is run, it would copy BOTH objects to the sub-folder since src_bucket.objects.filter(Prefix=source_folder)
will include all sub-folders.
If you only wish to copy objects in the 'top' of the source folder, then you will either need to:
/
and comparing it to the name of the source folder.