Search code examples
python-polars

Polars get count of events prior to "this" event, but within given duration


I have been struggling with creating a feature, a counter that counts number of events prior to each event, where each prior event should have occurred within a given duration (dt). I know how to do it for all previous events, it is easy by using cum_sum and over of the given column. But, if I want to do this with only events within e.g last 2 days, how do I do that ??

Below is how I do it (the wrong way) with cum_count.

import polars as pl
from datetime import date

df = pl.DataFrame(
    data = {
            "Event":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
            "Date":[
                date(2022,1,1),
                date(2022,1,2),
                date(2022,1,2),
                date(2022,1,3),
                date(2022,1,3),
                date(2022,1,5),
                date(2022,1,5),
                date(2022,1,8)
            ]
    }
)

df.with_columns(
    pl.col("Date").cum_count().over("Event").alias("cum_count")
)

outputting

shape: (8, 3)
┌───────┬────────────┬───────────┐
│ Event ┆ Date       ┆ cum_count │
│ ---   ┆ ---        ┆ ---       │
│ str   ┆ date       ┆ u32       │
╞═══════╪════════════╪═══════════╡
│ Rain  ┆ 2022-01-01 ┆ 1         │
│ Sun   ┆ 2022-01-02 ┆ 1         │
│ Rain  ┆ 2022-01-02 ┆ 2         │
│ Sun   ┆ 2022-01-03 ┆ 2         │
│ Rain  ┆ 2022-01-03 ┆ 3         │
│ Sun   ┆ 2022-01-05 ┆ 3         │
│ Rain  ┆ 2022-01-05 ┆ 4         │
│ Sun   ┆ 2022-01-08 ┆ 4         │
└───────┴────────────┴───────────┘

What I would like to output is this:

shape: (8, 3)
┌───────┬────────────┬───────────┐
│ Event ┆ Date       ┆ cum_count │
│ ---   ┆ ---        ┆ ---       │
│ str   ┆ date       ┆ u32       │
╞═══════╪════════════╪═══════════╡
│ Rain  ┆ 2022-01-01 ┆ 0         │
│ Sun   ┆ 2022-01-02 ┆ 0         │
│ Rain  ┆ 2022-01-02 ┆ 1         │
│ Sun   ┆ 2022-01-03 ┆ 1         │
│ Rain  ┆ 2022-01-03 ┆ 2         │
│ Sun   ┆ 2022-01-05 ┆ 1         │
│ Rain  ┆ 2022-01-05 ┆ 1         │
│ Sun   ┆ 2022-01-08 ┆ 0         │
└───────┴────────────┴───────────┘

(Preferably, a solution that scales somewhat well..)

Thanks

Tried this without success


Solution

  • You can try rolling for this.

    (
        df
        .rolling(
            index_column="Date",
            period="2d",
            group_by="Event",
            closed='both',
        )
        .agg(
            pl.len() - 1
        )
        .sort("Date", "Event", reverse=[False, True])
    )
    
    shape: (8, 3)
    ┌───────┬────────────┬─────┐
    │ Event ┆ Date       ┆ len │
    │ ---   ┆ ---        ┆ --- │
    │ str   ┆ date       ┆ u32 │
    ╞═══════╪════════════╪═════╡
    │ Rain  ┆ 2022-01-01 ┆ 0   │
    │ Sun   ┆ 2022-01-02 ┆ 0   │
    │ Rain  ┆ 2022-01-02 ┆ 1   │
    │ Sun   ┆ 2022-01-03 ┆ 1   │
    │ Rain  ┆ 2022-01-03 ┆ 2   │
    │ Sun   ┆ 2022-01-05 ┆ 1   │
    │ Rain  ┆ 2022-01-05 ┆ 1   │
    │ Sun   ┆ 2022-01-08 ┆ 0   │
    └───────┴────────────┴─────┘
    

    We subtract one in the agg because we do not want to count the current event, only prior events. (The sort at the end is just to order the rows to match the original data.)