Search code examples
pythonamazon-s3boto3

Filtering S3 shared key (subfolder) for specific file type in boto3


Python/boto3 here. I have an S3 bucket with several "subfolders", and in one of these "subfolders" (yes I know there's no such thing in S3, but when you look at the layout below you'll understand) I have several dozen files, 0+ of which will be Excel (XLSX) files. Here's what my bucket looks like:

my_bucket/
    Fizz/
    Buzz/
    Foo/
        file1.jpg
        file2.jpg
        file3.txt
        file4.xlsx
        file5.pdf
        file6.xlsx
        file7.png
        ...etc.

So for, say, file4.xlsx, the bucket is my_bucket and the key is Foo/file4.xlsx (if I understand S3 properly). For file7.png, the bucket is still my_bucket and its key is Foo/file7.png, etc.

I need to look under this Foo/ "subfolder" for any file that ends with a .xlsx extension, and if one exists, do a S3 GetObject on that Excel file. It's fine if no Excels exist, and its fine if multiple Excels exist. I just need to do a GetObject on the first one I find, if one is even there at all.

I understand that a typical boto3 invocation for getting an S3 object looks like:

s3 = Res.client("s3")
obj = s3.get_object(Bucket="my_bucket", Key="Foo/file2.jpg")

But I'm not sure how to list all the my_bucket/Foo/* contents, filter by the first *.xlsx and do the get_object(...) on that specific file. Can anyone help nudge me in the right direction?


Solution

  • I don't believe this is possible with S3.

    AWS does not support S3 objects filtering by suffix.

    But you can do this with two steps.

    s3_client = boto3.client('s3')
    bucket = 'my-bucket'
    prefix = 'my-prefix/foo/bar'
    paginator = s3_client.get_paginator('list_objects_v2')
    response_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix)
    
    file_names = []
    
    for response in response_iterator:
        for object_data in response['Contents']:
            key = object_data['Key']
            if key.endswith('.xlsx'):
                file_names.append(key)
    
    
    if file_names:
        response = s3_client.get_object(Bucket=bucket, Key=file_names[0])