Search code examples
pythonjsonamazon-web-servicesamazon-s3boto3

Find all JSON files within S3 Bucket


is it possible to find all .json files within S3 bucket where the bucket itself can have multiple sub-directories ?

Actually my bucket includes multiple sub-directories where i would like to collect all JSON files inside it in order to iterate over them and parse specific key/values.


Solution

  • Here's the solution (uses the boto module):

    import boto3
    
    s3 = boto3.client('s3')  # Create the connection to your bucket
    objs = s3.list_objects_v2(Bucket='my-bucket')['Contents']
    
    files = filter(lambda obj: obj['Key'].endswith('.json'), objs)  # json only 
    return files
    

    The syntax for the list_objects_v2 function in boto3 can be found here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2

    Note that only the first 1000 keys are returned. To retrieve more than 1000 keys to fully exhaust the list, you can use the Paginator class.

    s3 = boto3.client('s3')  # Create the connection to your bucket
    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket='my-bucket')
    
    files = []
    for page in pages:
        for obj in page['Contents']:
            page_files = filter(lambda obj: obj['Key'].endswith('.json'), objs)  # json only
            files.extend(page_files)
    return files
    

    Note: I recommend using a function that uses yield to slowly iterate over the files instead of returning the whole list, especially if the number of json files in your bucket is extremely large.

    Alternatively, you can also use the ContinuationToken parameter (check the boto3 reference linked above).