Search code examples

Convert pandas dataframe to parquet format and upload to s3 bucket

I have a list of parquet files that i need to copy from one S3 bucket to another s3 bucket in a different account. I have to add a few columns to the parquet files before I upload. I am trying to read files to a pandas dataframe and I am adding columns and converting it parquet but it does not seem to work.

here is what I am trying. my_parquet_list is where I am getting the list of all keys.

for file in my_parquet_list: 
    bucket = 'source_bucket_name'
    buffer = io.BytesIO()
    s3 = session.resource('s3')
    s3_obj = s3.Object(bucket,file)
    df = pd.read_parquet(buffer)
    df["col_new"] = 'xyz'
    df["date"] = datetime.datetime.utcnow()
    df.to_parquet(buffer, engine= 'pyarrow', index = False)
    bucketdest = 'dest_bucket_name'
    s3_file = 's3_folder_path/'+'.parquet'
    s3.Object(bucketdest, s3_file).put(Body=buffer.getvalue())


  • Just pip install s3fs, then configure you aws CLI, finally you can just use df.to_parquet('s3://bucket_name/output-dir/df.parquet.gzip',index=False)