I'm trying to create a pandas dataframe with bucket object data (list_objects_v2) using boto3.
Without pagination, I can easily create a dataframe by using recursion on the response and appending rows to the dataframe.
import boto3
import pandas
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=bucket_name) #this creates a nested json
print(response)
{'ResponseMetadata': {'RequestId': 'PGMCTZNAPV677CWE', 'HostId': '/8qweqweEfpdegFSNU/hfqweqweqweSHtM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '/8yacqweqwe/hfjuSwKXDv3qweqweqweHtM=', 'x-amz-request-id': 'PqweqweqweE', 'date': 'Fri, 09 Sep 2022 09:25:04 GMT', 'x-amz-bucket-region': 'eu-central-1', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': 'qweqweIntraday.csv', 'LastModified': datetime.datetime(2022, 7, 12, 8, 32, 10, tzinfo=tzutc()), 'ETag': '"qweqweqwe4"', 'Size': 1165, 'StorageClass': 'STANDARD'}], 'Name': 'test-bucket', 'Prefix': '', 'MaxKeys': 1000, 'EncodingType': 'url', 'KeyCount': 1}
object_df = pandas.DataFrame()
for elem in response:
if 'Contents' in elem:
object_df = pandas.json_normalize(response['Contents'])
Because of the 1000 row limitation of list_objects_v2, I'm trying to get to the same result using recursion. I attempted to do this with the following code, but I don't get the desired output (infinite loops on larger buckets).
object_df = pandas.DataFrame()
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
for elem in page:
if 'Contents' in elem:
object_df = pandas.json_normalize(page['Contents'])
I managed to find a solution with adding another dataframe and just appending each page to it.
appended_object_df = pandas.DataFrame()
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
object_df = pandas.DataFrame()
object_df = pandas.json_normalize(page['Contents'])
appended_object_df=appended_object_df.append(object_df, ignore_index=True)
I'm still curious if it's possible to skip the appending part and have the code directly generate the complete df.
Per the pandas documentation:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
So, you could do:
df_list = []
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
page_df = pandas.json_normalize(page['Contents'])
df_list.append(page_df)
object_df = pandas.concat(df_list, ignore_index=True)