I have some data with 2 columns in the following format:
df = pl.from_repr("""
┌──────┬─────────┐
│ nums ┆ letters │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════╪═════════╡
│ 160 ┆ B │
│ 540 ┆ I │
│ 550 ┆ I │
│ 590 ┆ B │
│ 370 ┆ I │
└──────┴─────────┘
""")
And I want to merge the cells that come in the form of B
in addition to consecutive I
's (i.e. BI*
):
shape: (2, 1)
┌─────────────────┐
│ nums │
│ --- │
│ list[i64] │
╞═════════════════╡
│ [160, 540, 550] │
│ [590, 370] │
└─────────────────┘
How can I achieve something like this with polars?
You can search for the consecutive "B" -> "I"
pairs, toggle those to true
and cum_sum
between them.
df = pl.DataFrame({
"nums": [160, 540, 550, 590, 370],
"letters": ["B", "I", "I", "B", "I"]
})
(df.group_by(
((pl.col("letters") + pl.col("letters").shift(-1)).eq_missing("BI"))
.cum_sum()
.alias("cum_sum"),
maintain_order=True,
).agg("nums"))
This outputs:
shape: (2, 2)
┌─────────┬─────────────────┐
│ cum_sum ┆ nums │
│ --- ┆ --- │
│ u32 ┆ list[i64] │
╞═════════╪═════════════════╡
│ 1 ┆ [160, 540, 550] │
│ 2 ┆ [590, 370] │
└─────────┴─────────────────┘