DuckDB slower than Polars in single table over + groupby context

For the following toy example which involves both calculations over window and groupby aggregations, DuckDB performs nearly 3x slower than Polars in Python. Both give exactly the same results.

Is this kind of benchmarking result as expected, because DuckDB is designed and should be used more for cross-dataframe/table operations?

Or, is it just because the inefficiency comes from the way my SQL query is written?

import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 10_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 100, nrows),
        id2=rng.integers(1, 1_000, nrows),
        v1=rng.normal(0, 1, nrows),
        v2=rng.normal(0, 1, nrows),
        v3=rng.normal(0, 1, nrows),
        v4=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.select(
        [
            "id",
            "id2",
            pl.col("v1") - pl.col("v1").mean().over(["id", "id2"]),
            pl.col("v2") - pl.col("v2").mean().over(["id", "id2"]),
            pl.col("v3") - pl.col("v3").mean().over(["id", "id2"]),
            pl.col("v4") - pl.col("v4").mean().over(["id", "id2"]),
        ]
    )
    .groupby(["id", "id2"])
    .agg(
        [
            (pl.col("v1") * pl.col("v2")).sum().alias("ans1"),
            (pl.col("v3") * pl.col("v4")).sum().alias("ans2"),
        ]
    )
)
time.perf_counter() - start
# 1.0977217499166727

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        """
        SELECT id, id2,
        v1 - mean(v1) OVER (PARTITION BY id, id2) as v1,
        v2 - mean(v2) OVER (PARTITION BY id, id2) as v2,
        v3 - mean(v3) OVER (PARTITION BY id, id2) as v3,
        v4 - mean(v4) OVER (PARTITION BY id, id2) as v4,
        FROM df
        """
    )
    .aggregate(
        "id, id2, sum(v1 * v2) as ans1, sum(v3 * v4) as ans2",
        "id, id2",
    )
    .pl()
)
time.perf_counter() - start
# 3.549897135235369

Solution

What DuckDB version are you running against? The window computations you are using were optimised for 0.8.x because they are constant across each partition.

I made a benchmark for a similar query (just different random numbers) and it ran in about the same time as the rewritten v2 query by @jqurious:

load
SELECT SETSEED(0.8675309);
CREATE TABLE df AS
    SELECT 
        (random() * 100)::INTEGER + 1 AS id,
        (random() * 1000)::INTEGER + 1 AS id2,
        random() AS v1,
        random() AS v2,
        random() AS v3,
        random() AS v4,
    FROM range(10000000);

run
SELECT id, id2, sum(v1 * v2) as ans1, sum(v3 * v4) as ans2
FROM (
    SELECT id, id2,
    v1 - mean(v1) OVER (PARTITION BY id, id2) as v1,
    v2 - mean(v2) OVER (PARTITION BY id, id2) as v2,
    v3 - mean(v3) OVER (PARTITION BY id, id2) as v3,
    v4 - mean(v4) OVER (PARTITION BY id, id2) as v4,
    FROM df
)
GROUP BY ALL

Results:

name    run timing
benchmark/micro/window/temp.benchmark   1   0.930582
benchmark/micro/window/temp.benchmark   2   0.914845
benchmark/micro/window/temp.benchmark   3   0.965103
benchmark/micro/window/temp.benchmark   4   0.951137
benchmark/micro/window/temp.benchmark   5   0.945187

With some pending vectorisation improvements, it drops another 10%:

name    run timing
benchmark/micro/window/temp.benchmark   1   0.823040
benchmark/micro/window/temp.benchmark   2   0.826832
benchmark/micro/window/temp.benchmark   3   0.851842
benchmark/micro/window/temp.benchmark   4   0.797956
benchmark/micro/window/temp.benchmark   5   0.861512