python json amazon-web-services amazon-s3 boto3

Find all JSON files within S3 Bucket

is it possible to find all .json files within S3 bucket where the bucket itself can have multiple sub-directories ?

Actually my bucket includes multiple sub-directories where i would like to collect all JSON files inside it in order to iterate over them and parse specific key/values.

Solution

Here's the solution (uses the boto module):

import boto3

s3 = boto3.client('s3')  # Create the connection to your bucket
objs = s3.list_objects_v2(Bucket='my-bucket')['Contents']

files = filter(lambda obj: obj['Key'].endswith('.json'), objs)  # json only 
return files

The syntax for the list_objects_v2 function in boto3 can be found here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2

Note that only the first 1000 keys are returned. To retrieve more than 1000 keys to fully exhaust the list, you can use the Paginator class.

s3 = boto3.client('s3')  # Create the connection to your bucket
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='my-bucket')

files = []
for page in pages:
    for obj in page['Contents']:
        page_files = filter(lambda obj: obj['Key'].endswith('.json'), objs)  # json only
        files.extend(page_files)
return files

Note: I recommend using a function that uses yield to slowly iterate over the files instead of returning the whole list, especially if the number of json files in your bucket is extremely large.

Alternatively, you can also use the ContinuationToken parameter (check the boto3 reference linked above).