I have a dataframe with columns x
, y
, c_1
, c2
, ..., c_K
, where K is somewhat large (K ≈ 1000 or 2000).
Each of the columns c_i
is boolean column, and I'd like to compute an aggregation f(x, y)
over rows where where c_i
is True. (For example, f(x,y) = x.sum() * y.sum()
.)
One way to do this is:
ds.select([
f(pl.col("x").filter(pl.col(f"c_{i+1}"), pl.col("y").filter(pl.col(f"c_{i+1}"))
for i in range(K)
])
In my problem, the number K
is large, and the above query seems somewhat inefficient (filtering is done twice).
EDIT.
Here is a runnable example (code at bottom), as well as some timings corresponding to @Hericks's answer below. TLDR: Method 1 as proposed is the current best.
Wall time | ||
---|---|---|
1 | repeated filters | 409ms |
2 | pl.concat |
29.6s (≈70x slower) |
2* | pl.concat , lazy |
1.27s (3x slower) |
3 | unpivot with agg | 1min 17s |
3* | unpivot with agg, lazy | 1min 17s (same as 3) |
import polars as pl
import polars.selectors as cs
import numpy as np
rng = np.random.default_rng()
def f(x,y):
return x.sum() * y.sum()
N = 2_000_000
K = 1000
dat = dict()
dat["x"] = np.random.randn(N)
dat["y"] = np.random.randn(N)
for i in range(K):
dat[f"c_{i+1}"] = rng.choice(2, N).astype(np.bool_)
tmpds = pl.DataFrame(dat)
## Method 1
tmpds.select([
f(
pl.col("x").filter(pl.col(f"c_{i+1}")),
pl.col("y").filter(pl.col(f"c_{i+1}")))
.alias(f"f_{i+1}") for i in range(K)
])
## Method 2
pl.concat([
tmpds.filter(pl.col(f"c_{i+1}")).select(f(pl.col("x"), pl.col("y")).alias(f"f_{i+1}"))
for i in range(K)
], how="horizontal")
## Method 2*
pl.concat([
tmpds.lazy().filter(pl.col(f"c_{i+1}")).select(f(pl.col("x"), pl.col("y")).alias(f"f_{i+1}")).collect()
for i in range(K)
], how="horizontal")
## Method 3
(
tmpds
.unpivot(on=cs.starts_with("c"), index=["x", "y"])
.filter("value")
.group_by("variable")
.agg(
f(pl.col("x"), pl.col("y"))
)
)
##Method 3*
(
tmpds
.lazy()
.unpivot(on=cs.starts_with("c"), index=["x", "y"])
.filter("value")
.group_by("variable", maintain_order=True)
.agg(
f(pl.col("x"), pl.col("y"))
)
.collect()
)
In general, I would not think that multiple filters are inefficient in polars. To verify this, I benchmarked three different approaches:
This approach was proposed in the question.
df.select(
(
pl.col("x").filter(f"c_{i+1}").sum()
* pl.col("y").filter(f"c_{i+1}").sum()
).alias(f"c_{i}")
for i in range(K)
)
This approach was proposed by @roman. Again, a python generator is used to create the dataframes.
pl.concat(
[
(
df
.filter(pl.col(f"c_{i+1}"))
.select((pl.col("x").sum() * pl.col("y").sum()).alias(f"c_{i+1}"))
)
for i in range(K)
],
how="horizontal"
)
I thought this approach was nice as it doesn't rely on standard python generator expressions and doesn't use f-strings to create column names.
import polars.selectors as cs
(
df
.unpivot(on=cs.starts_with("c"), index=["x", "y"])
.filter("value")
.group_by("variable", maintain_order=True)
.agg(
pl.col("x").sum() * pl.col("y").sum()
)
)
The results for f(x, y) = x.sum() * y.sum()
on a dataframe with n = 5000
rows and K = 1000
c_i
-columns` are as follows. The results confirm the initial impression.
Runtime | ||
---|---|---|
1 | repeated filters | 16.7ms |
2 | pl.concat |
4150ms |
3 | unpivot with agg | 40.9ms |