How can I efficiently scan multiple remote parquet files in parallel?

Suppose I have urls, a list of s3 Parquet urls (on S3).

I observe that this collect_all runs in O(urls).

Is there a better way to parallelize this task?

import polars as pl
pl.collect_all((
    pl.scan_parquet(url).filter(expr) for url in urls)
))

Solution

Depending on what your expr is and what you're doing next, you might be better off with

pl.concat([
pl.scan_parquet(url) for url in urls
]).filter(expr).collect()

One difference is that instead of getting a list of distinct dfs this one assumes you want them all combined into one and that they have the same schema.

Another approach is to use asyncio

import asyncio

await asyncio.gather(*[pl.scan_parquet(url).filter(expr).collect_async() for url in urls])

I've see where the asyncio.gather approach is slightly faster than the alternative.