is it possible to find all .json
files within S3 bucket
where the bucket itself can have multiple sub-directories ?
Actually my bucket includes multiple sub-directories where i would like to collect all JSON files inside it in order to iterate over them and parse specific key/values.
Here's the solution (uses the boto module):
import boto3
s3 = boto3.client('s3') # Create the connection to your bucket
objs = s3.list_objects_v2(Bucket='my-bucket')['Contents']
files = filter(lambda obj: obj['Key'].endswith('.json'), objs) # json only
return files
The syntax for the list_objects_v2
function in boto3 can be found here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2
Note that only the first 1000 keys are returned. To retrieve more than 1000 keys to fully exhaust the list, you can use the Paginator
class.
s3 = boto3.client('s3') # Create the connection to your bucket
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='my-bucket')
files = []
for page in pages:
for obj in page['Contents']:
page_files = filter(lambda obj: obj['Key'].endswith('.json'), objs) # json only
files.extend(page_files)
return files
Note: I recommend using a function that uses yield
to slowly iterate over the files instead of returning the whole list, especially if the number of json files in your bucket is extremely large.
Alternatively, you can also use the ContinuationToken
parameter (check the boto3 reference linked above).