For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, DuckDB
gives much better performance than Polars
(~10x on a 32-core machine).
My questions are:
DuckDB
) of Polars
?Polars
?import time
import duckdb
import numpy as np
import polars as pl
## example dataframe
rng = np.random.default_rng(1)
nrows = 5_000_000
df = pl.DataFrame(
dict(
id=rng.integers(1, 1_000, nrows),
id2=rng.integers(1, 10, nrows),
id3=rng.integers(1, 500, nrows),
value=rng.normal(0, 1, nrows),
)
)
## polars
start = time.perf_counter()
res = (
df.lazy()
.join(df.lazy(), on=["id", "id2"], how="left")
.filter(
(pl.col("id3") > pl.col("id3_right"))
& (pl.col("id3") - pl.col("id3_right") < 30)
)
.group_by(["id2", "id3", "id3_right"])
.agg(pl.corr("value", "value_right"))
.collect(streaming=True)
)
time.perf_counter() - start
# 120.93155245436355
## duckdb
start = time.perf_counter()
res2 = (
duckdb.sql(
"""
SELECT df.*, df2.id3 as id3_right, df2.value as value_right
FROM df JOIN df as df2
ON (df.id = df2.id
AND df.id2 = df2.id2
AND df.id3 > df2.id3
AND df.id3 - df2.id3 < 30)
"""
)
.aggregate(
"id2, id3, id3_right, corr(value, value_right) as value",
"id2, id3, id3_right",
)
.pl()
)
time.perf_counter() - start
# 18.472263277042657
The latest polars release has brought the difference down from 15x to 2x.
polars v0.18.2 1125
polars v0.18.3 140
duckdb 0.8.2-dev1 75
The streaming API isn't as optimized yet. Polars is a younger project than DuckDB and we haven't got as many paid developers on the project.
So give us time. Next release 0.18.3
will land a PR that can make a streaming groupby over 3.5x faster https://github.com/pola-rs/polars/pull/9346.
That just shows how much we still have on the table on the streaming engine. That same optimization we still have to do for streaming joins.
In short. Our streaming engine is in alpha stage. It is work in progress.
Other that that, the duckdb query might also be using non-equi joins under the hood which we don't have yet in polars, so this query might not be as optimal for polars.