I am hoping to count consecutive values in a column, preferably using Polars expressions.
import polars
df = pl.DataFrame(
{"value": [True,True,True,False,False,True,False,False,True,True]}
)
With the example data frame above, I would like to count the number of consecutive True values.
Below is an example using R's Data.Table package.
library(data.table)
dt <- data.table(value = c(T,T,T,F,F,T,F,F,T,T))
dt[, value2 := fifelse((1:.N) == .N & value == 1, .N, NA_integer_), by = rleid(value)]
dt
Desired result displayed as a Polars DataFrame
shape: (10, 2)
┌───────┬──────┐
│ value ┆ len │
│ --- ┆ --- │
│ bool ┆ u32 │
╞═══════╪══════╡
│ true ┆ null │
│ true ┆ null │
│ true ┆ 3 │
│ false ┆ null │
│ false ┆ null │
│ true ┆ 1 │
│ false ┆ null │
│ false ┆ null │
│ true ┆ null │
│ true ┆ 2 │
└───────┴──────┘
Any ideas how this would be done efficiently using Polars?
Polars has an rle_id()
method similar to the R example.
We can use that as our "group id", and get the pl.len()
of each group with .over()
df.with_columns(
pl.len().over(pl.col("value").rle_id())
)
shape: (10, 2)
┌───────┬─────┐
│ value ┆ len │
│ --- ┆ --- │
│ bool ┆ u32 │
╞═══════╪═════╡
│ true ┆ 3 │
│ true ┆ 3 │
│ true ┆ 3 │
│ false ┆ 2 │
│ false ┆ 2 │
│ true ┆ 1 │
│ false ┆ 2 │
│ false ┆ 2 │
│ true ┆ 2 │
│ true ┆ 2 │
└───────┴─────┘
If you want to null out the other rows, one way could be .is_last_distinct()
inside a .when().then()
df.with_columns(
pl.when(pl.col("value") & pl.col("value").is_last_distinct())
.then(pl.len())
.over(pl.col("value").rle_id())
)
shape: (10, 2)
┌───────┬──────┐
│ value ┆ len │
│ --- ┆ --- │
│ bool ┆ u32 │
╞═══════╪══════╡
│ true ┆ null │
│ true ┆ null │
│ true ┆ 3 │
│ false ┆ null │
│ false ┆ null │
│ true ┆ 1 │
│ false ┆ null │
│ false ┆ null │
│ true ┆ null │
│ true ┆ 2 │
└───────┴──────┘