Search code examples
pythonpython-polars

Conditional aggregation of rows in polars


I have some data with 2 columns in the following format:

df = pl.from_repr("""
┌──────┬─────────┐
│ nums ┆ letters │
│ ---  ┆ ---     │
│ i64  ┆ str     │
╞══════╪═════════╡
│ 160  ┆ B       │
│ 540  ┆ I       │
│ 550  ┆ I       │
│ 590  ┆ B       │
│ 370  ┆ I       │
└──────┴─────────┘
""")

And I want to merge the cells that come in the form of B in addition to consecutive I's (i.e. BI*):

shape: (2, 1)
┌─────────────────┐
│ nums            │
│ ---             │
│ list[i64]       │
╞═════════════════╡
│ [160, 540, 550] │
│ [590, 370]      │
└─────────────────┘

How can I achieve something like this with polars?


Solution

  • You can search for the consecutive "B" -> "I" pairs, toggle those to true and cum_sum between them.

    df = pl.DataFrame({
        "nums": [160, 540, 550, 590, 370],
        "letters": ["B", "I", "I", "B", "I"]
    })
    
    (df.group_by(
        ((pl.col("letters") + pl.col("letters").shift(-1)).eq_missing("BI"))
        .cum_sum()
        .alias("cum_sum"),
        maintain_order=True,
    ).agg("nums"))
    
    

    This outputs:

    shape: (2, 2)
    ┌─────────┬─────────────────┐
    │ cum_sum ┆ nums            │
    │ ---     ┆ ---             │
    │ u32     ┆ list[i64]       │
    ╞═════════╪═════════════════╡
    │ 1       ┆ [160, 540, 550] │
    │ 2       ┆ [590, 370]      │
    └─────────┴─────────────────┘