Polars dataframe: how to efficiently aggregate over many non-disjoint groups

I have a dataframe with columns x, y, c_1, c2, ..., c_K, where K is somewhat large (K ≈ 1000 or 2000).

Each of the columns c_i is boolean column, and I'd like to compute an aggregation f(x, y) over rows where where c_i is True. (For example, f(x,y) = x.sum() * y.sum().)

One way to do this is:

ds.select([
    f(pl.col("x").filter(pl.col(f"c_{i+1}"), pl.col("y").filter(pl.col(f"c_{i+1}"))
    for i in range(K)
])

In my problem, the number K is large, and the above query seems somewhat inefficient (filtering is done twice).

What is the recommended/most efficient/most elegant way of accomplishing this?

EDIT.

Here is a runnable example (code at bottom), as well as some timings corresponding to @Hericks's answer below. TLDR: Method 1 as proposed is the current best.

		Wall time
1	repeated filters	409ms
2	`pl.concat`	29.6s (≈70x slower)
2*	`pl.concat`, lazy	1.27s (3x slower)
3	unpivot with agg	1min 17s
3*	unpivot with agg, lazy	1min 17s (same as 3)

import polars as pl
import polars.selectors as cs
import numpy as np
rng = np.random.default_rng()

def f(x,y):
    return x.sum() * y.sum()

N = 2_000_000
K = 1000
dat = dict()
dat["x"] = np.random.randn(N)
dat["y"] = np.random.randn(N)
for i in range(K):
    dat[f"c_{i+1}"] = rng.choice(2, N).astype(np.bool_)

tmpds = pl.DataFrame(dat)


## Method 1
tmpds.select([
    f(
        pl.col("x").filter(pl.col(f"c_{i+1}")),
        pl.col("y").filter(pl.col(f"c_{i+1}")))
    .alias(f"f_{i+1}") for i in range(K)
])

## Method 2
pl.concat([
    tmpds.filter(pl.col(f"c_{i+1}")).select(f(pl.col("x"), pl.col("y")).alias(f"f_{i+1}"))
    for i in range(K)
], how="horizontal")

## Method 2*
pl.concat([
    tmpds.lazy().filter(pl.col(f"c_{i+1}")).select(f(pl.col("x"), pl.col("y")).alias(f"f_{i+1}")).collect()
    for i in range(K)
], how="horizontal")

## Method 3
(
    tmpds
    .unpivot(on=cs.starts_with("c"), index=["x", "y"])
    .filter("value")
    .group_by("variable")
    .agg(
        f(pl.col("x"), pl.col("y"))
    )
)

##Method 3*
(
    tmpds
    .lazy()
    .unpivot(on=cs.starts_with("c"), index=["x", "y"])
    .filter("value")
    .group_by("variable", maintain_order=True)
    .agg(
        f(pl.col("x"), pl.col("y"))
    )
    .collect()
)

Solution

In general, I would not think that multiple filters are inefficient in polars. To verify this, I benchmarked three different approaches:

1. Generator of expressions with 2 filters

This approach was proposed in the question.

df.select(
    (
        pl.col("x").filter(f"c_{i+1}").sum()
        * pl.col("y").filter(f"c_{i+1}").sum()
    ).alias(f"c_{i}")
    for i in range(K)
)

2. Concatenating dataframes created using single filter

This approach was proposed by @roman. Again, a python generator is used to create the dataframes.

pl.concat(
    [
        (
            df
            .filter(pl.col(f"c_{i+1}"))
            .select((pl.col("x").sum() * pl.col("y").sum()).alias(f"c_{i+1}"))
        )
        for i in range(K)
    ],
    how="horizontal"
)

3. Unpivot + single filter + group by / agg

I thought this approach was nice as it doesn't rely on standard python generator expressions and doesn't use f-strings to create column names.

import polars.selectors as cs

(
    df
    .unpivot(on=cs.starts_with("c"), index=["x", "y"])
    .filter("value")
    .group_by("variable", maintain_order=True)
    .agg(
        pl.col("x").sum() * pl.col("y").sum()
    )
)

The results for f(x, y) = x.sum() * y.sum() on a dataframe with n = 5000 rows and K = 1000 c_i-columns` are as follows. The results confirm the initial impression.

		Runtime
1	repeated filters	16.7ms
2	`pl.concat`	4150ms
3	unpivot with agg	40.9ms