Search code examples
pythonpython-polars

How best to apply successive filters to a Polars Dataframe?


I have a Polars Dataframe with a bunch of columns (I'm working in Python) I need to group the dataframe by three different row criteria - every row will match exactly one of the three criteria. Minimal example data would look something like this

df = pl.from_repr("""
┌─────┬───────┬────────┐
│ a   ┆   b   ┆   c    │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ i64   ┆ i64    │
╞═════╪═══════╪════════╡
│ 1   ┆   0   ┆   0    │
│ 2   ┆   1   ┆   1    │
│ 3   ┆   0   ┆   1    │
└─────┴───────┴────────┘
""")

My goal is to identify,

  1. rows where b and c are both zero
  2. rows where b and c are both non-zero
  3. rows where b and c are mixed zero/non-zero

In the real use case the dataframes are large - millions of rows and ~30 columns and the row queries are more complex, but satisfy the same basic row uniqueness criteria.

I can find the rows with b and c= 0 using

all_zeros = df.filter(pl.all_horizontal(pl.col('b','c')==0))

and the the non-zero rows using

no_zeros = df.filter(pl.all_horizontal(pl.col('b','c')!=0))

The third classification is whatever doesn't match these two queries.

What is the most efficient way of implementing these queries?

As written above, the second filter has to process the entire dataframe, even though the first filter has identified rows that could be omitted from the second filter? Is there some way I can uses these filters in a group_by?


Solution

  • For usage in a .group_by you could use .when/.then to assign a "group id":

    df.group_by(
       pl.when(pl.all_horizontal(pl.col('b', 'c') == 0))
         .then(1)
         .when(pl.all_horizontal(pl.col('b', 'c') != 0))
         .then(2)
    ).all()
    
    shape: (3, 4)
    ┌─────────┬───────────┬───────────┬───────────┐
    │ literal ┆ a         ┆ b         ┆ c         │
    │ ---     ┆ ---       ┆ ---       ┆ ---       │
    │ i32     ┆ list[i64] ┆ list[i64] ┆ list[i64] │
    ╞═════════╪═══════════╪═══════════╪═══════════╡
    │ 2       ┆ [2]       ┆ [1]       ┆ [1]       │
    │ 1       ┆ [1]       ┆ [0]       ┆ [0]       │
    │ null    ┆ [3]       ┆ [0]       ┆ [1]       │
    └─────────┴───────────┴───────────┴───────────┘