Polars is much slower than DuckDB in conditional join + group_by/agg context

For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, DuckDB gives much better performance than Polars (~10x on a 32-core machine).

My questions are:

What could be the potential reason(s) for the slowness (relative to DuckDB) of Polars?
Am I missing some other faster ways of doing the same thing in Polars?

import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 5_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 1_000, nrows),
        id2=rng.integers(1, 10, nrows),
        id3=rng.integers(1, 500, nrows),
        value=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.lazy()
    .join(df.lazy(), on=["id", "id2"], how="left")
    .filter(
        (pl.col("id3") > pl.col("id3_right"))
        & (pl.col("id3") - pl.col("id3_right") < 30)
    )
    .group_by(["id2", "id3", "id3_right"])
    .agg(pl.corr("value", "value_right"))
    .collect(streaming=True)
)
time.perf_counter() - start
# 120.93155245436355

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        """
        SELECT df.*, df2.id3 as id3_right, df2.value as value_right
        FROM df JOIN df as df2
        ON (df.id = df2.id
        AND df.id2 = df2.id2
        AND df.id3 > df2.id3
        AND df.id3 - df2.id3 < 30)
        """
    )
    .aggregate(
        "id2, id3, id3_right, corr(value, value_right) as value",
        "id2, id3, id3_right",
    )
    .pl()
)
time.perf_counter() - start
# 18.472263277042657

Solution

EDIT: 2023-7-18

The latest polars release has brought the difference down from 15x to 2x.

polars v0.18.2  1125
polars v0.18.3  140
duckdb 0.8.2-dev1   75

Original answer

Streaming engine

The streaming API isn't as optimized yet. Polars is a younger project than DuckDB and we haven't got as many paid developers on the project.

So give us time. Next release 0.18.3 will land a PR that can make a streaming groupby over 3.5x faster https://github.com/pola-rs/polars/pull/9346.

That just shows how much we still have on the table on the streaming engine. That same optimization we still have to do for streaming joins.

In short. Our streaming engine is in alpha stage. It is work in progress.

Different algorithm

Other that that, the duckdb query might also be using non-equi joins under the hood which we don't have yet in polars, so this query might not be as optimal for polars.