Search code examples

Find all JSON files within S3 Bucket

is it possible to find all .json files within S3 bucket where the bucket itself can have multiple sub-directories ?

Actually my bucket includes multiple sub-directories where i would like to collect all JSON files inside it in order to iterate over them and parse specific key/values.


  • Here's the solution (uses the boto module):

    import boto3
    s3 = boto3.client('s3')  # Create the connection to your bucket
    objs = s3.list_objects_v2(Bucket='my-bucket')['Contents']
    files = filter(lambda obj: obj['Key'].endswith('.json'), objs)  # json only 
    return files

    The syntax for the list_objects_v2 function in boto3 can be found here:

    Note that only the first 1000 keys are returned. To retrieve more than 1000 keys to fully exhaust the list, you can use the Paginator class.

    s3 = boto3.client('s3')  # Create the connection to your bucket
    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket='my-bucket')
    files = []
    for page in pages:
        for obj in page['Contents']:
            page_files = filter(lambda obj: obj['Key'].endswith('.json'), objs)  # json only
    return files

    Note: I recommend using a function that uses yield to slowly iterate over the files instead of returning the whole list, especially if the number of json files in your bucket is extremely large.

    Alternatively, you can also use the ContinuationToken parameter (check the boto3 reference linked above).