Search code examples
amazon-web-servicesamazon-s3boto3aws-glue

How to list S3 objects efficiently?


I have implemented a pipeline where I only want to list the objects modified or added after a specific time. I have written the below code and that is working fine but it is not that efficient. My source contains continuous CDC Data that is large in number (999+).

As per my understanding, the below code will list objects based on the prefix first and later it filters out objects last modified at or after the specified date/time. Can we skip listing all the objects every time?

The time taken by the code for the timestamp 2010-07-25 00:00:00+00:00 is the same as the time taken for 2023-07-25 00:00:00+00:00. Ideally, for the 2023 datetime, it should take less time because the total number of objects with the first timestamp 2010-07-25 00:00:00+00:00 is around 1M but with 2023-07-25 00:00:00+00:00 it is only 4 objects.

import json, boto3

def get_objects(bucket, key):
    s3 = boto3.client("s3")
    s3_paginator = s3.get_paginator('list_objects_v2')
    s3_iterator = s3_paginator.paginate(Bucket=bucket, Prefix=key)
    print(s3_iterator)
    filtered_iterator = s3_iterator.search(
        "Contents[?to_string(LastModified)>='\"2023-07-25 00:00:00+00:00\"'].Key"
    )
    c = 0
    for key_data in filtered_iterator:
        c =  c + 1
        print(str(c) + '--> ' + str(key_data))
        
    
    print("count is " + str(c))

Simply I want to implement the backend logic of the Glue Job Bookmark.


Solution

  • Listing objects in a large Amazon S3 bucket is quite time-consuming since each API call returns a maximum of 1000 objects.

    Some alternatives are:

    • Trigger an AWS Lambda function whenever objects are created. Write some code that either operates on the object immediately or writes the object information to a database so that it can be accessed quickly in future. OR
    • Use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. This way, you don't need to make any API calls but the list is only updated daily/weekly.