Polars - speedup by using partition_by and collect_all

Example setup

Warning: 5gb memory df creation

import time

import numpy as np
import polars as pl

rng = np.random.default_rng(1)

nrows = 50_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 50, nrows),
        id2=rng.integers(1, 500, nrows),
        v=rng.normal(0, 1, nrows),
        v1=rng.normal(0, 1, nrows),
        v2=rng.normal(0, 1, nrows),
        v3=rng.normal(0, 1, nrows),
        v4=rng.normal(0, 1, nrows),
        v5=rng.normal(0, 1, nrows),
        v6=rng.normal(0, 1, nrows),
        v7=rng.normal(0, 1, nrows),
        v8=rng.normal(0, 1, nrows),
        v9=rng.normal(0, 1, nrows),
        v10=rng.normal(0, 1, nrows),
    )
)

I have a simple task on hand as follows.

start = time.perf_counter()
res = (
    df.lazy()
    .with_columns(
        pl.col(f"v{i}") - pl.col(f"v{i}").mean().over("id", "id2")
        for i in range(1, 11)
    )
    .group_by("id", "id2")
    .agg((pl.col(f"v{i}") * pl.col("v")).sum() for i in range(1, 11))
    .collect()
)
time.perf_counter() - start
# 9.85

This task above completes in ~10s on a 16-core machine.

However, if I first split/partition the df by id and then perform the same calculation as above and call collect_all and concat at the end, I can get a nearly 2x speedup.

start = time.perf_counter()
res2 = pl.concat(
    pl.collect_all(
        dfi.lazy()
        .with_columns(
            pl.col(f"v{i}") - pl.col(f"v{i}").mean().over("id", "id2")
            for i in range(1, 11)
        )
        .group_by("id", "id2")
        .agg((pl.col(f"v{i}") * pl.col("v")).sum() for i in range(1, 11))
            for dfi in df.partition_by("id", maintain_order=False)
    )
)
time.perf_counter() - start
# 5.60

In addition, if I do the partition by id2 instead of id, the time it takes will be even faster ~4s.

I also noticed the second approach (either partition by id or id2) has better CPU utilization rate than the first one. Maybe this is the reason why the second approach is faster.

My question is:

Why the second approach is faster and has better CPU utilization?
Shouldn't they be the same in terms of performance, since I think window/group_by operations will always be executed in parallel for each window/group and use as many available resources as possible?

Solution

Initial disclaimer, this is somewhat speculative as I haven't looked through the source but I think it's well founded

I did both tests in several modes

test	time
test1 (as-is)	20.44
test2 (as-is)	16.07
test1 w/aggpow	22.41
test2 w/aggpow	19.99
test1 w/w/pow	52.08
test2 w/w/pow	21.81
test1 w/aggpow eager	47.90
test2 w/aggpow eager	61.49

and on the first run I got 20.44 and 16.07 from test1 to test2 so I am replicating the same direction that you are but the severity on my computer was less.

I had htop running during both tests. The memory seemed to peak at roughly the same usage but what was different was that in test1 the cores weren't fully loaded whereas in test2 I could see all the cores were pretty consistently near the top.

To explore this further I added .pow(1.2).pow(0.3888) in the with_columns and another round with them in the agg.

With those expensive operations in the agg (specifically, it was after the sum() test1 took 22.41 and test2 was 19.985.

I took that out of the agg and put it in the with_columns (specifically on pl.col(f"v{i}"), not the mean just the raw first one). With the expensive operation there, the difference was really staggering. test1 was 52.08 and test2 was 21.81.

During test1 I could see lulls where the cores were nearly idle while it was presumably doing the much cheaper agg. During test2 they were pretty consistently maxed out. I also did both tests in eager mode and the results flipped. In eager mode test1 was 47.90 while test2 was 61.49.

Based on the aforementioned results, I'm guessing that in collect_all mode it hands one frame to each core and that frame is worked on only by that core as opposed to each expression getting a core. Because of that, it doesn't matter that the pre-group operations are much more expensive than the agg. It'll just keep working hard the whole time.

In single frame mode, it can't know in advance how expensive the operations will be so it just groups them according to their natural groups. As a result it takes the first group, does the pre-group operations then it takes that to the agg. During the agg it doesn't work so hard and, I think, that's why I'm seeing the cores go to near idle in waves. It's not until after the agg is done that it starts on the next group and the cores will ramp up in intensity again.

I'm sure my guess isn't exactly right but I think it's close.