Search code examples
pythonpython-3.xamazon-s3boto3

Fetch the latest 2 extreme subfolders from s3 bucket using python


I have a s3 bucket which has multiple integrations.

I want to read the files from the latest 2 extreme subfolders.

enter image description here

I want to read all files from 2023/1/30/ and 2023/1/31/

import boto3

bucket_name = 'Bucket'
prefix = 'Facebook/Ad/'


s3_conn = boto3.client("s3")

response = s3_conn.list_objects_v2(Bucket=bucket_name, Prefix=prefix)


objects = sorted(response['Contents'], key=lambda x: x['LastModified'], reverse=True)

for obj in objects[:2]:
    subfolder = obj['Key']
    print(f"Subfolder: {subfolder}")

But this gives me the latest 2 files from the last subfolder:

2023/1/31/file12
2023/1/31/file13

How Can I read all files from the last 2 subfolders? Also, I do want to hard code things as the level of subfolders might increase. I need to find some how the latest 2 subfolders at the deepest level and fetch all files from them.


Solution

  • To properly get the list of common prefixes that with some feature like "latest object" you will need to enumerate all of the objects. This can be done with simply enough with list_objects_v2, however each call to list_objects_v2 is limited to returning 1000 objects.

    Boto3 provides a helper, called Paginators that does the work for you of calling APIs like this multiple times with the proper parameters to work through all of the pages.

    Since your goal is to get a list of all objects that share two common prefixes, it makes sense to keep the complete list of objects as you encounter them, then you can operate on the list as desired after you've determined which two common prefixes are interesting.

    Putting it together, the code to do this would look something like this:

    import boto3
    from collections import defaultdict
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')
    
    # Build a list of all common prefixes, with their objects
    common_prefixes = defaultdict(list)
    for page in paginator.paginate(Bucket=bucket_name):
        for cur in page.get('Contents', []):
            common_prefix = "/".join(cur['Key'].split("/")[:-1])
            # Store the information broken out by common prefix
            common_prefixes[common_prefix].append(cur)
    
    # Turn the dictionary into a simple list of common prefixes and objects
    common_prefixes = [(k, v) for k, v in common_prefixes.items()]
    # Sort by the max last modified date in each prefix
    common_prefixes.sort(key=lambda x: max(y['LastModified'] for y in x[1]))
    
    # And now show information on the two most recent common prefixes
    for common_prefix, objects in common_prefixes[-2:]:
        print(f"----- Prefix: {common_prefix} -----")
        # Just show the objects in a format somewhat like "aws s3 ls"
        for cur in objects:
            print(f"{cur['LastModified'].strftime('%Y-%m-%d %H:%M:%S')} {cur['Size']:10d} {cur['Key']}")
    

    It should be noted that doing this will take time and memory for very large buckets, not to mention API calls. If have a very large bucket with millions of objects, you should consider setting up an S3 inventory report so you can pull down a single object to get a list of objects, or if the prefixes follow a predictable pattern, use a few calls to list_objects_v2 with the Delimiter flag, still via the Paginator, to directly find the prefixes and operate on them directly without querying for metadata on the objects.