Search code examples
pythonamazon-web-servicesamazon-s3boto3

Boto3 S3: Get files without getting folders


Using boto3, how can I retrieve all files in my S3 bucket without retrieving the folders?

Consider the following file structure:

file_1.txt
folder_1/
    file_2.txt
    file_3.txt
    folder_2/
        folder_3/
            file_4.txt

In this example Im only interested in the 4 files.

EDIT:

A manual solution is:

def count_files_in_folder(prefix):
    total = 0
    keys = s3_client.list_objects(Bucket=bucket_name, Prefix=prefix)
    for key in keys['Contents']:
        if key['Key'][-1:] != '/':
            total += 1
    return total

In this case total would be 4.

If I just did

count = len(s3_client.list_objects(Bucket=bucket_name, Prefix=prefix))

the result would be 7 objects (4 files and 3 folders):

file.txt
folder_1/
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/
folder_1/folder_2/folder_3/
folder_1/folder_2/folder_3/file_4.txt

I JUST want:

file.txt
folder_1/file_2.txt
folder_1/file_3.txt  
folder_1/folder_2/folder_3/file_4.txt

Solution

  • S3 is an OBJECT STORE. It DOES NOT store file/object under directories tree. New comer always confuse the "folder" option given by them, which in fact an arbitrary prefix for the object.

    object PREFIX is a way to retrieve your object organised by predefined fix file name(key) prefix structure, e.g. .

    You can imagine using a file system that don't allow you to create a directory, but allow you to create file name with a slash "/" or backslash "\" as delimiter, and you can denote "level" of the file by a common prefix.

    Thus in S3, you can use following to "simulate directory" that is not a directory.

    folder1-folder2-folder3-myobject
    folder1/folder2/folder3/myobject
    folder1\folder2\folder3\myobject
    

    As you can see, object name can store inside S3 regardless what kind of arbitrary folder separator(delimiter) you use.

    However, to help user to make bulks file transfer to S3, tools such as aws cli, s3_transfer api attempt to simplify the step and create object name follow your input local folder structure.

    So if you are sure that all the S3 object is using / or \ as separator , you can use tools like S3transfer or AWSCcli to make a simple download by using the key name.

    Here is the quick and dirty code using the resource iterator. Using s3.resource.object.filter will return iterator that doesn't have same 1000 keys limit as list_objects()/list_objects_v2().

    import os 
    import boto3
    s3 = boto3.resource('s3')
    mybucket = s3.Bucket("mybucket")
    # if blank prefix is given, return everything)
    bucket_prefix="/some/prefix/here"
    objs = mybucket.objects.filter(
        Prefix = bucket_prefix)
    
    for obj in objs:
        path, filename = os.path.split(obj.key)
        # boto3 s3 download_file will throw exception if folder not exists
        try:
            os.makedirs(path) 
        except FileExistsError:
            pass
        mybucket.download_file(obj.key, obj.key)