I have been struggling with creating a feature, a counter that counts number of events prior to each event, where each prior event should have occurred within a given duration (dt). I know how to do it for all previous events, it is easy by using cum_sum
and over
of the given column. But, if I want to do this with only events within e.g last 2 days, how do I do that ??
Below is how I do it (the wrong way) with cum_count
.
import polars as pl
from datetime import date
df = pl.DataFrame(
data = {
"Event":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Date":[
date(2022,1,1),
date(2022,1,2),
date(2022,1,2),
date(2022,1,3),
date(2022,1,3),
date(2022,1,5),
date(2022,1,5),
date(2022,1,8)
]
}
)
df.with_columns(
pl.col("Date").cum_count().over("Event").alias("cum_count")
)
outputting
shape: (8, 3)
┌───────┬────────────┬───────────┐
│ Event ┆ Date ┆ cum_count │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═══════════╡
│ Rain ┆ 2022-01-01 ┆ 1 │
│ Sun ┆ 2022-01-02 ┆ 1 │
│ Rain ┆ 2022-01-02 ┆ 2 │
│ Sun ┆ 2022-01-03 ┆ 2 │
│ Rain ┆ 2022-01-03 ┆ 3 │
│ Sun ┆ 2022-01-05 ┆ 3 │
│ Rain ┆ 2022-01-05 ┆ 4 │
│ Sun ┆ 2022-01-08 ┆ 4 │
└───────┴────────────┴───────────┘
What I would like to output is this:
shape: (8, 3)
┌───────┬────────────┬───────────┐
│ Event ┆ Date ┆ cum_count │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═══════════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
└───────┴────────────┴───────────┘
(Preferably, a solution that scales somewhat well..)
Thanks
Tried this without success
You can try rolling
for this.
(
df
.rolling(
index_column="Date",
period="2d",
group_by="Event",
closed='both',
)
.agg(
pl.len() - 1
)
.sort("Date", "Event", reverse=[False, True])
)
shape: (8, 3)
┌───────┬────────────┬─────┐
│ Event ┆ Date ┆ len │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
└───────┴────────────┴─────┘
We subtract one in the agg
because we do not want to count the current event, only prior events. (The sort at the end is just to order the rows to match the original data.)