Search code examples
pythonpandasamazon-s3bytesio

Writing Pandas dataframe.groupby results to S3 bucket


I have a large dataframe I am trying to break into smaller piece and write to a csv file in S3. For testing purposes I have the groupby size set very low, but the concept is the same. Here is the code I have:

if not submittingdata.empty:
    for i, g in submittingdata.groupby(df.index // 200):
        data = BytesIO()
        g.to_csv(data)
        s3_client.upload_fileobj(
            data,
            Bucket='some-magic-bucket',
            Key=f'file_prep_{i}.csv'
        )

The chunks are working correctly and the files are all being created as intended, but they are all empty. Not sure what I am missing. My understanding is that the g.to_csv(data) should be writing the csv body to the BytesIO object, which is then what I'm using to write to the file. Maybe I'm misunderstanding that?


Solution

  • After following Patryks suggestion above I was able to find a piece of code that works. Using Resource rather than client in boto3 and then writing to the body of a put from the BytesIO buffer I was able to get files populated with values. The working code is:

    if not submittingdata.empty:
        for i, g in submittingdata.groupby(df.index // 200):
            data = BytesIO()
            g.to_csv(data)
            s3_resource.Object(
                'some-magic-bucket',
                f'file_prep_{i}.csv'
            ).put(
                Body=data.getvalue()
            )