Search code examples
pythonamazon-s3aws-lambdapython-polars

Polars: AWS S3 conection pooling


I want to use Polars to read from Parquet files stored on S3. I'm running my code in AWS Lambda.

When using boto3 I would make a client in the global scope in order that the connection is reused for each invocation (e.g, a client is made for each Lambda cold start, but not for each invocation):


client - boto3.client("s3")

def handler(event, context):
    # Use the client here, ensuring the connection already exist

The Polars documentation says that Polars can connect to S3 for me by looking at the location of the file I'm reading:

df = pl.read_parquet("s3://path/to/file.parquet")

However, if I put this inside the handler, I assume the connection is re-created for each Lambda invocation. I really want to be able to pass a connection into read_parquet (or scan_parquet, or other IO methods) like:

df = pl.read_parquet("s3://path/to/file.parquet", connection_options={"aws": {"client": client}})

My reading of the docs is that the client config is simply about how it connects, not a client object.

If I'm wrong and I can pass a client, what sort of object should it be? Am I wrong in assuming that a connection pool is useful here, or is the underlying API doing this for me, in some way?


Solution

  • I made an issue for this and the Polars maintainers made a change that improves the cache performance.

    However, as per my comment on the issue, using boto3 to download the files is still faster than using Polars to read from S3.

    I don't know exactly what part of the object_store / Polars stack is slower than boto3, but at the time of writing I'm using boto3 in my application.