I'd like to ask around if anyone knows how to do rolling indexing in polars? I have personally tried a few solutions which did not work for me (I'll show them below):
What I'd like to do: Indexing the number of occurrences within the past X days by Name Example: Let's say I'd like to index occurrences within the past 2 days:
Name | Date | Counter |
---|---|---|
John | 1 Jan 23 | 1 |
John | 1 Jan 23 | 2 |
John | 1 Jan 23 | 3 |
John | 1 Jan 23 | 4 |
John | 2 Jan 23 | 5 |
John | 2 Jan 23 | 6 |
John | 2 Jan 23 | 7 |
John | 2 Jan 23 | 8 |
John | 3 Jan 23 | 5 |
John | 3 Jan 23 | 6 |
New Guy | 1 Jan 23 | 1 |
In this case, the counter resets to "1" starting from the past X days (e.g. for 3 Jan 23, it starts "1" from 2 Jan 23), or if a new name is detected
What I've tried:
df.groupby_rolling(index_column='Date', period='2d', by='Name', check_sorted=False).agg((pl.col("Date").rank(method='ordinal')).alias("Counter"))
The above does not work because it outputs:
Name | Date | Counter |
---|---|---|
John | 1 Jan 23 | 1,2,3,4 |
John | 1 Jan 23 | 1,2,3,4 |
John | 1 Jan 23 | 1,2,3,4 |
John | 1 Jan 23 | 1,2,3,4 |
John | 2 Jan 23 | 1...8 |
John | 2 Jan 23 | 1...8 |
John | 2 Jan 23 | 1...8 |
John | 2 Jan 23 | 1...8 |
John | 3 Jan 23 | 1...6 |
John | 3 Jan 23 | 1...6 |
New Guy | 1 Jan 23 | 1 |
df.with_columns( Counter=pl.col("mask").rolling_sum(window_size='2d', by="Date") )
Where I made a column "Mask" which is just a column of "1"s, and tried to sum them, but it outputs:
Name | Date | Mask | Counter |
---|---|---|---|
John | 1 Jan 23 | 1 | 4 |
John | 1 Jan 23 | 1 | 4 |
John | 1 Jan 23 | 1 | 4 |
John | 1 Jan 23 | 1 | 4 |
John | 2 Jan 23 | 1 | 8 |
John | 2 Jan 23 | 1 | 8 |
John | 2 Jan 23 | 1 | 8 |
John | 2 Jan 23 | 1 | 8 |
John | 3 Jan 23 | 1 | 6 |
John | 3 Jan 23 | 1 | 6 |
And it also cannot handle "New Guy" correctly because rolling_sum cannot do by="Name", "Date"
df.with_columns(Counter = pl.col("Date").rank(method='ordinal').over(["Name", "Date"]) )
The above code works correctly, but can only be used for indexing within the same day (i.e. period="1d")
Additional Notes: I also did this in Excel, and also using a brute/raw method of using a "for"-loop. Both worked perfectly, however they struggled with huge amounts of data.
What I read: Some references to help in answers: (Most didn't work because they have fixed rolling window instead of a dynamic window by "Date")
How to implement rolling rank in Polars version 0.19
https://github.com/pola-rs/polars/issues/4808
How to do group_by_rolling grouped by day by hour in polars in Python?
How to groupby and rolling in polars?
https://docs.pola.rs/py-polars/html/reference/series/api/polars.Series.rank.html
https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_rolling.html
You could start with the approach giving the maximum count for each group (using pl.len()
within the aggregation) and post-process the Counter
column to make it's values increase within each group.
(
df
.rolling(index_column="Date", period="2d", group_by="Name", check_sorted=False)
.agg(
pl.len().alias("Counter")
)
.with_columns(
(pl.col("Counter") - pl.len() + 1 + pl.int_range(pl.len())).over("Name", "Date")
)
)
shape: (11, 3)
┌─────────┬────────────┬─────────┐
│ Name ┆ Date ┆ Counter │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ i64 │
╞═════════╪════════════╪═════════╡
│ John ┆ 2023-01-01 ┆ 1 │
│ John ┆ 2023-01-01 ┆ 2 │
│ John ┆ 2023-01-01 ┆ 3 │
│ John ┆ 2023-01-01 ┆ 4 │
│ John ┆ 2023-01-02 ┆ 5 │
│ John ┆ 2023-01-02 ┆ 6 │
│ John ┆ 2023-01-02 ┆ 7 │
│ John ┆ 2023-01-02 ┆ 8 │
│ John ┆ 2023-01-03 ┆ 5 │
│ John ┆ 2023-01-03 ┆ 6 │
│ New Guy ┆ 2023-01-01 ┆ 1 │
└─────────┴────────────┴─────────┘
Explanation. After the aggregation, the Counter
column will take a constant value V within each name-date-group of length L. The objective is to make Counter
take the values V-L+1 to V (one value for each row) instead.
Therefore, we can
Counter
and