I'm trying to isolate runs of dates using polars. For this, I've been playing with .rolling
, .rle
, and .rle_id
but can't seem to fit them together to make it work.
Given this:
df = pl.DataFrame({
"date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-05", "2024-01-06", "2024-01-07"]
})
df = df.with_columns(pl.col("date").str.to_date())
I would like to have a 2nd column with the id of the run if they're contiguous:
[1, 1, 1, 2, 2, 2]
You could take the diff
of the successive dates, identify those that are not 1 day and compute a cum_sum
:
(pl.DataFrame({'date': ['2024-01-01', '2024-01-02', '2024-01-03',
'2024-01-05', '2024-01-06', '2024-01-07']})
.with_columns(pl.col('date').str.to_date())
.with_columns(id=pl.col('date').diff().ne(pl.duration(days=1))
.fill_null(True).cum_sum())
)
Alternatively, generate a range of the step (with pl.duration
), subtract it to the date and use rle_id
:
(pl.DataFrame({'date': ['2024-01-01', '2024-01-02', '2024-01-03',
'2024-01-05', '2024-01-06', '2024-01-07']})
.with_columns(pl.col('date').str.to_date())
.with_columns(id=(pl.col('date')
-pl.duration(days=pl.arange(0, pl.col('date').len()))
).rle_id()+1)
)
Output:
┌────────────┬─────┐
│ date ┆ id │
│ --- ┆ --- │
│ date ┆ u32 │
╞════════════╪═════╡
│ 2024-01-01 ┆ 1 │
│ 2024-01-02 ┆ 1 │
│ 2024-01-03 ┆ 1 │
│ 2024-01-05 ┆ 2 │
│ 2024-01-06 ┆ 2 │
│ 2024-01-07 ┆ 2 │
└────────────┴─────┘