Search code examples
pythonpython-polars

Polars dataframe: how to efficiently aggregate over many non-disjoint groups


I have a dataframe with columns x, y, c_1, c2, ..., c_K, where K is somewhat large (K ≈ 1000 or 2000).

Each of the columns c_i is boolean column, and I'd like to compute an aggregation f(x, y) over rows where where c_i is True. (For example, f(x,y) = x.sum() * y.sum().)

One way to do this is:

ds.select([
    f(pl.col("x").filter(pl.col(f"c_{i+1}"), pl.col("y").filter(pl.col(f"c_{i+1}"))
    for i in range(K)
])

In my problem, the number K is large, and the above query seems somewhat inefficient (filtering is done twice).

  • What is the recommended/most efficient/most elegant way of accomplishing this?

EDIT.

Here is a runnable example (code at bottom), as well as some timings corresponding to @Hericks's answer below. TLDR: Method 1 as proposed is the current best.

Wall time
1 repeated filters 409ms
2 pl.concat 29.6s (≈70x slower)
2* pl.concat, lazy 1.27s (3x slower)
3 unpivot with agg 1min 17s
3* unpivot with agg, lazy 1min 17s (same as 3)
import polars as pl
import polars.selectors as cs
import numpy as np
rng = np.random.default_rng()

def f(x,y):
    return x.sum() * y.sum()

N = 2_000_000
K = 1000
dat = dict()
dat["x"] = np.random.randn(N)
dat["y"] = np.random.randn(N)
for i in range(K):
    dat[f"c_{i+1}"] = rng.choice(2, N).astype(np.bool_)

tmpds = pl.DataFrame(dat)


## Method 1
tmpds.select([
    f(
        pl.col("x").filter(pl.col(f"c_{i+1}")),
        pl.col("y").filter(pl.col(f"c_{i+1}")))
    .alias(f"f_{i+1}") for i in range(K)
])

## Method 2
pl.concat([
    tmpds.filter(pl.col(f"c_{i+1}")).select(f(pl.col("x"), pl.col("y")).alias(f"f_{i+1}"))
    for i in range(K)
], how="horizontal")

## Method 2*
pl.concat([
    tmpds.lazy().filter(pl.col(f"c_{i+1}")).select(f(pl.col("x"), pl.col("y")).alias(f"f_{i+1}")).collect()
    for i in range(K)
], how="horizontal")

## Method 3
(
    tmpds
    .unpivot(on=cs.starts_with("c"), index=["x", "y"])
    .filter("value")
    .group_by("variable")
    .agg(
        f(pl.col("x"), pl.col("y"))
    )
)

##Method 3*
(
    tmpds
    .lazy()
    .unpivot(on=cs.starts_with("c"), index=["x", "y"])
    .filter("value")
    .group_by("variable", maintain_order=True)
    .agg(
        f(pl.col("x"), pl.col("y"))
    )
    .collect()
)

Solution

  • In general, I would not think that multiple filters are inefficient in polars. To verify this, I benchmarked three different approaches:

    1. Generator of expressions with 2 filters

    This approach was proposed in the question.

    df.select(
        (
            pl.col("x").filter(f"c_{i+1}").sum()
            * pl.col("y").filter(f"c_{i+1}").sum()
        ).alias(f"c_{i}")
        for i in range(K)
    )
    

    2. Concatenating dataframes created using single filter

    This approach was proposed by @roman. Again, a python generator is used to create the dataframes.

    pl.concat(
        [
            (
                df
                .filter(pl.col(f"c_{i+1}"))
                .select((pl.col("x").sum() * pl.col("y").sum()).alias(f"c_{i+1}"))
            )
            for i in range(K)
        ],
        how="horizontal"
    )
    

    3. Unpivot + single filter + group by / agg

    I thought this approach was nice as it doesn't rely on standard python generator expressions and doesn't use f-strings to create column names.

    import polars.selectors as cs
    
    (
        df
        .unpivot(on=cs.starts_with("c"), index=["x", "y"])
        .filter("value")
        .group_by("variable", maintain_order=True)
        .agg(
            pl.col("x").sum() * pl.col("y").sum()
        )
    )
    

    The results for f(x, y) = x.sum() * y.sum() on a dataframe with n = 5000 rows and K = 1000 c_i-columns` are as follows. The results confirm the initial impression.

    Runtime
    1 repeated filters 16.7ms
    2 pl.concat 4150ms
    3 unpivot with agg 40.9ms