I have implemented a pipeline where I only want to list the objects modified or added after a specific time. I have written the below code and that is working fine but it is not that efficient. My source contains continuous CDC Data that is large in number (999+).
As per my understanding, the below code will list objects based on the prefix first and later it filters out objects last modified at or after the specified date/time. Can we skip listing all the objects every time?
The time taken by the code for the timestamp 2010-07-25 00:00:00+00:00
is the same as the time taken for 2023-07-25 00:00:00+00:00
. Ideally, for the 2023 datetime, it should take less time because the total number of objects with the first timestamp 2010-07-25 00:00:00+00:00
is around 1M but with 2023-07-25 00:00:00+00:00
it is only 4 objects.
import json, boto3
def get_objects(bucket, key):
s3 = boto3.client("s3")
s3_paginator = s3.get_paginator('list_objects_v2')
s3_iterator = s3_paginator.paginate(Bucket=bucket, Prefix=key)
print(s3_iterator)
filtered_iterator = s3_iterator.search(
"Contents[?to_string(LastModified)>='\"2023-07-25 00:00:00+00:00\"'].Key"
)
c = 0
for key_data in filtered_iterator:
c = c + 1
print(str(c) + '--> ' + str(key_data))
print("count is " + str(c))
Simply I want to implement the backend logic of the Glue Job Bookmark.
Listing objects in a large Amazon S3 bucket is quite time-consuming since each API call returns a maximum of 1000 objects.
Some alternatives are: