Creating a dataframe of s3 object data with a paginator

I'm trying to create a pandas dataframe with bucket object data (list_objects_v2) using boto3.

Without pagination, I can easily create a dataframe by using recursion on the response and appending rows to the dataframe.

import boto3
import pandas
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=bucket_name) #this creates a nested json

print(response)
{'ResponseMetadata': {'RequestId': 'PGMCTZNAPV677CWE', 'HostId': '/8qweqweEfpdegFSNU/hfqweqweqweSHtM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '/8yacqweqwe/hfjuSwKXDv3qweqweqweHtM=', 'x-amz-request-id': 'PqweqweqweE', 'date': 'Fri, 09 Sep 2022 09:25:04 GMT', 'x-amz-bucket-region': 'eu-central-1', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': 'qweqweIntraday.csv', 'LastModified': datetime.datetime(2022, 7, 12, 8, 32, 10, tzinfo=tzutc()), 'ETag': '"qweqweqwe4"', 'Size': 1165, 'StorageClass': 'STANDARD'}], 'Name': 'test-bucket', 'Prefix': '', 'MaxKeys': 1000, 'EncodingType': 'url', 'KeyCount': 1}

object_df = pandas.DataFrame()
for elem in response:
    if 'Contents' in elem:
        object_df = pandas.json_normalize(response['Contents'])

Because of the 1000 row limitation of list_objects_v2, I'm trying to get to the same result using recursion. I attempted to do this with the following code, but I don't get the desired output (infinite loops on larger buckets).

object_df = pandas.DataFrame()
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
    for elem in page:
        if 'Contents' in elem:
            object_df = pandas.json_normalize(page['Contents'])

I managed to find a solution with adding another dataframe and just appending each page to it.

appended_object_df = pandas.DataFrame()
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
        object_df = pandas.DataFrame()
        object_df = pandas.json_normalize(page['Contents'])
        appended_object_df=appended_object_df.append(object_df, ignore_index=True)

I'm still curious if it's possible to skip the appending part and have the code directly generate the complete df.

Solution

Per the pandas documentation:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

So, you could do:

df_list = []

paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
    page_df = pandas.json_normalize(page['Contents'])
    df_list.append(page_df)

object_df = pandas.concat(df_list, ignore_index=True)