Search code examples
pythonpython-polars

How can I efficiently scan multiple remote parquet files in parallel?


Suppose I have urls, a list of s3 Parquet urls (on S3).

I observe that this collect_all runs in O(urls).

Is there a better way to parallelize this task?

import polars as pl
pl.collect_all((
    pl.scan_parquet(url).filter(expr) for url in urls)
))

Solution

  • Depending on what your expr is and what you're doing next, you might be better off with

    pl.concat([
    pl.scan_parquet(url) for url in urls
    ]).filter(expr).collect()
    

    One difference is that instead of getting a list of distinct dfs this one assumes you want them all combined into one and that they have the same schema.

    Another approach is to use asyncio

    import asyncio
    
    await asyncio.gather(*[pl.scan_parquet(url).filter(expr).collect_async() for url in urls])
    

    I've see where the asyncio.gather approach is slightly faster than the alternative.