Suppose I have urls
, a list of s3
Parquet urls (on S3).
I observe that this collect_all
runs in O(urls).
Is there a better way to parallelize this task?
import polars as pl
pl.collect_all((
pl.scan_parquet(url).filter(expr) for url in urls)
))
Depending on what your expr
is and what you're doing next, you might be better off with
pl.concat([
pl.scan_parquet(url) for url in urls
]).filter(expr).collect()
One difference is that instead of getting a list of distinct dfs this one assumes you want them all combined into one and that they have the same schema.
Another approach is to use asyncio
import asyncio
await asyncio.gather(*[pl.scan_parquet(url).filter(expr).collect_async() for url in urls])
I've see where the asyncio.gather approach is slightly faster than the alternative.