Search code examples
pythonduckdb

DuckDB slower than Polars in single table over + groupby context


For the following toy example which involves both calculations over window and groupby aggregations, DuckDB performs nearly 3x slower than Polars in Python. Both give exactly the same results.

Is this kind of benchmarking result as expected, because DuckDB is designed and should be used more for cross-dataframe/table operations?

Or, is it just because the inefficiency comes from the way my SQL query is written?

import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 10_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 100, nrows),
        id2=rng.integers(1, 1_000, nrows),
        v1=rng.normal(0, 1, nrows),
        v2=rng.normal(0, 1, nrows),
        v3=rng.normal(0, 1, nrows),
        v4=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.select(
        [
            "id",
            "id2",
            pl.col("v1") - pl.col("v1").mean().over(["id", "id2"]),
            pl.col("v2") - pl.col("v2").mean().over(["id", "id2"]),
            pl.col("v3") - pl.col("v3").mean().over(["id", "id2"]),
            pl.col("v4") - pl.col("v4").mean().over(["id", "id2"]),
        ]
    )
    .groupby(["id", "id2"])
    .agg(
        [
            (pl.col("v1") * pl.col("v2")).sum().alias("ans1"),
            (pl.col("v3") * pl.col("v4")).sum().alias("ans2"),
        ]
    )
)
time.perf_counter() - start
# 1.0977217499166727

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        """
        SELECT id, id2,
        v1 - mean(v1) OVER (PARTITION BY id, id2) as v1,
        v2 - mean(v2) OVER (PARTITION BY id, id2) as v2,
        v3 - mean(v3) OVER (PARTITION BY id, id2) as v3,
        v4 - mean(v4) OVER (PARTITION BY id, id2) as v4,
        FROM df
        """
    )
    .aggregate(
        "id, id2, sum(v1 * v2) as ans1, sum(v3 * v4) as ans2",
        "id, id2",
    )
    .pl()
)
time.perf_counter() - start
# 3.549897135235369

Solution

  • What DuckDB version are you running against? The window computations you are using were optimised for 0.8.x because they are constant across each partition.

    I made a benchmark for a similar query (just different random numbers) and it ran in about the same time as the rewritten v2 query by @jqurious:

    load
    SELECT SETSEED(0.8675309);
    CREATE TABLE df AS
        SELECT 
            (random() * 100)::INTEGER + 1 AS id,
            (random() * 1000)::INTEGER + 1 AS id2,
            random() AS v1,
            random() AS v2,
            random() AS v3,
            random() AS v4,
        FROM range(10000000);
    
    run
    SELECT id, id2, sum(v1 * v2) as ans1, sum(v3 * v4) as ans2
    FROM (
        SELECT id, id2,
        v1 - mean(v1) OVER (PARTITION BY id, id2) as v1,
        v2 - mean(v2) OVER (PARTITION BY id, id2) as v2,
        v3 - mean(v3) OVER (PARTITION BY id, id2) as v3,
        v4 - mean(v4) OVER (PARTITION BY id, id2) as v4,
        FROM df
    )
    GROUP BY ALL
    

    Results:

    name    run timing
    benchmark/micro/window/temp.benchmark   1   0.930582
    benchmark/micro/window/temp.benchmark   2   0.914845
    benchmark/micro/window/temp.benchmark   3   0.965103
    benchmark/micro/window/temp.benchmark   4   0.951137
    benchmark/micro/window/temp.benchmark   5   0.945187
    

    With some pending vectorisation improvements, it drops another 10%:

    name    run timing
    benchmark/micro/window/temp.benchmark   1   0.823040
    benchmark/micro/window/temp.benchmark   2   0.826832
    benchmark/micro/window/temp.benchmark   3   0.851842
    benchmark/micro/window/temp.benchmark   4   0.797956
    benchmark/micro/window/temp.benchmark   5   0.861512