I have a Polars Dataframe with a bunch of columns (I'm working in Python) I need to group the dataframe by three different row criteria - every row will match exactly one of the three criteria. Minimal example data would look something like this
df = pl.from_repr("""
┌─────┬───────┬────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪════════╡
│ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1 ┆ 1 │
│ 3 ┆ 0 ┆ 1 │
└─────┴───────┴────────┘
""")
My goal is to identify,
In the real use case the dataframes are large - millions of rows and ~30 columns and the row queries are more complex, but satisfy the same basic row uniqueness criteria.
I can find the rows with b and c= 0 using
all_zeros = df.filter(pl.all_horizontal(pl.col('b','c')==0))
and the the non-zero rows using
no_zeros = df.filter(pl.all_horizontal(pl.col('b','c')!=0))
The third classification is whatever doesn't match these two queries.
What is the most efficient way of implementing these queries?
As written above, the second filter has to process the entire dataframe, even though the first filter has identified rows that could be omitted from the second filter? Is there some way I can uses these filters in a group_by?
For usage in a .group_by
you could use .when/.then
to assign a "group id":
df.group_by(
pl.when(pl.all_horizontal(pl.col('b', 'c') == 0))
.then(1)
.when(pl.all_horizontal(pl.col('b', 'c') != 0))
.then(2)
).all()
shape: (3, 4)
┌─────────┬───────────┬───────────┬───────────┐
│ literal ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════╪═══════════╪═══════════╪═══════════╡
│ 2 ┆ [2] ┆ [1] ┆ [1] │
│ 1 ┆ [1] ┆ [0] ┆ [0] │
│ null ┆ [3] ┆ [0] ┆ [1] │
└─────────┴───────────┴───────────┴───────────┘