Search code examples
pythonpandasamazon-s3memory-managementparquet

AWS Wrangler - Pandas red_sql to S3 in a limited memory environment


I am looking for a way to extract data from a database and push that data into a parquet dataset in S3 in a environment with limited memory. If I proceed like this:

with someDB.connect() as connect:
    df = pd.read_sql("SELECT * FROM table", connect)
    wr.s3.to_parquet(df, dataset=True, path="s3://flo-bucket/")

The Pandas data frame (df) is fully loaded in memory then pushed to S3 by wrangler. So if the data frame is too big, the operation fails. I would like to chunk the data frame and pass those chunks to a process (does not have to be wrangler) that would progressively send them to S3 in parquet format. Is this possible? I found examples using an IO buffer for a CSV file but I don't think it's possible with parquet.


Solution

  • It is possible to read the data in small chunks as it is available in read_sql function.

    try something like:

    import pandas as pd
    import awswrangler as wr
    
    path='s3://bucket/prefix'
    
    def write_parquet_chunk(chunk, path, index=False):
        wr.s3.to_parquet(
            df=chunk,
            path=path,
            dataset=True,
            mode="append",
            index=index
        )
    
    chunksize = 100
    
    with someDB.connect() as connect:
        query = "SELECT * FROM table"
        chunks = pd.read_sql(query, connect, chunksize=chunksize)
    
        for i, chunk in enumerate(chunks):
            print(f"Processing chunk {i+1}")
            write_parquet_chunk(chunk, path)
    

    Ref: https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html