Search code examples
pythonamazon-s3parquetpython-polars

with python, is there a way to load a polars dataframe directly into an s3 bucket as parquet


looking for something like this:

Save Dataframe to csv directly to s3 Python

the api shows these arguments: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html

but i'm not sure how to convert the df into a stream...


Solution

  • Untested, since I don't have an AWS account

    You could use s3fs.S3File like this:

    import polars as pl
    import s3fs
    
    fs = s3fs.S3FileSystem(anon=True)  # picks up default credentials
    df = pl.DataFrame(
        {
            "foo": [1, 2, 3, 4, 5],
            "bar": [6, 7, 8, 9, 10],
            "ham": ["a", "b", "c", "d", "e"],
        }
    )
    with fs.open('my-bucket/dataframe-dump.parquet', mode='wb') as f:
        df.write_parquet(f)
    

    Basically s3fs gives you an fsspec conformant file object, which polars knows how to use because write_parquet accepts any regular file or streams.

    If you want to manage your S3 connection more granularly, you can construct as S3File object from the botocore connection (see the docs linked above).