I'd like to ask around if anyone knows how to do rolling indexing in polars? I have personally tried a few solutions which did not work for me (I'll show them below):
What I'd like to do: Indexing the number of occurrences within the past X days by Name Example: Let's say I'd like to index occurrences within the past 2 days:
df = pl.from_repr("""
┌─────────┬─────────────────────┬─────────┐
│ Name ┆ Date ┆ Counter │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[ns] ┆ i64 │
╞═════════╪═════════════════════╪═════════╡
│ John ┆ 2023-01-01 00:00:00 ┆ 1 │
│ John ┆ 2023-01-01 00:00:00 ┆ 2 │
│ John ┆ 2023-01-01 00:00:00 ┆ 3 │
│ John ┆ 2023-01-01 00:00:00 ┆ 4 │
│ John ┆ 2023-01-02 00:00:00 ┆ 5 │
│ John ┆ 2023-01-02 00:00:00 ┆ 6 │
│ John ┆ 2023-01-02 00:00:00 ┆ 7 │
│ John ┆ 2023-01-02 00:00:00 ┆ 8 │
│ John ┆ 2023-01-03 00:00:00 ┆ 5 │
│ John ┆ 2023-01-03 00:00:00 ┆ 6 │
│ New Guy ┆ 2023-01-01 00:00:00 ┆ 1 │
└─────────┴─────────────────────┴─────────┘
""")
In this case, the counter resets to "1" starting from the past X days (e.g. for 3 Jan 23, it starts "1" from 2 Jan 23), or if a new name is detected
What I've tried:
(df.rolling(index_column='Date', period='2d', group_by='Name')
.agg((pl.col("Date").rank(method='ordinal')).alias("Counter"))
)
The above does not work because it outputs:
┌─────────┬─────────────────────┬──────────────────────────┐
│ Name ┆ Date ┆ Counter │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[ns] ┆ list[u32] │
╞═════════╪═════════════════════╪══════════════════════════╡
│ John ┆ 2023-01-01 00:00:00 ┆ [1, 2, 3, 4] │
│ John ┆ 2023-01-01 00:00:00 ┆ [1, 2, 3, 4] │
│ John ┆ 2023-01-01 00:00:00 ┆ [1, 2, 3, 4] │
│ John ┆ 2023-01-01 00:00:00 ┆ [1, 2, 3, 4] │
│ John ┆ 2023-01-02 00:00:00 ┆ [1, 2, 3, 4, 5, 6, 7, 8] │
│ John ┆ 2023-01-02 00:00:00 ┆ [1, 2, 3, 4, 5, 6, 7, 8] │
│ John ┆ 2023-01-02 00:00:00 ┆ [1, 2, 3, 4, 5, 6, 7, 8] │
│ John ┆ 2023-01-02 00:00:00 ┆ [1, 2, 3, 4, 5, 6, 7, 8] │
│ John ┆ 2023-01-03 00:00:00 ┆ [1, 2, 3, 4, 5, 6] │
│ John ┆ 2023-01-03 00:00:00 ┆ [1, 2, 3, 4, 5, 6] │
│ New Guy ┆ 2023-01-01 00:00:00 ┆ [1] │
└─────────┴─────────────────────┴──────────────────────────┘
(df.with_columns(Mask=1)
.with_columns(Counter=pl.col("Mask").rolling_sum_by(window_size='2d', by="Date"))
)
But it outputs:
┌─────────┬─────────────────────┬─────────┬──────┐
│ Name ┆ Date ┆ Counter ┆ mask │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[ns] ┆ i32 ┆ i32 │
╞═════════╪═════════════════════╪═════════╪══════╡
│ John ┆ 2023-01-01 00:00:00 ┆ 5 ┆ 1 │
│ John ┆ 2023-01-01 00:00:00 ┆ 5 ┆ 1 │
│ John ┆ 2023-01-01 00:00:00 ┆ 5 ┆ 1 │
│ John ┆ 2023-01-01 00:00:00 ┆ 5 ┆ 1 │
│ John ┆ 2023-01-02 00:00:00 ┆ 9 ┆ 1 │
│ John ┆ 2023-01-02 00:00:00 ┆ 9 ┆ 1 │
│ John ┆ 2023-01-02 00:00:00 ┆ 9 ┆ 1 │
│ John ┆ 2023-01-02 00:00:00 ┆ 9 ┆ 1 │
│ John ┆ 2023-01-03 00:00:00 ┆ 6 ┆ 1 │
│ John ┆ 2023-01-03 00:00:00 ┆ 6 ┆ 1 │
│ New Guy ┆ 2023-01-01 00:00:00 ┆ 5 ┆ 1 │
└─────────┴─────────────────────┴─────────┴──────┘
And it also cannot handle "New Guy" correctly because rolling_sum cannot do group_by=["Name", "Date"]
df.with_columns(Counter = pl.col("Date").rank(method='ordinal').over("Name", "Date") )
The above code works correctly, but can only be used for indexing within the same day (i.e. period="1d")
I also did this in Excel, and also using a brute/raw method of using a "for"-loop. Both worked perfectly, however they struggled with huge amounts of data.
What I read: Some references to help in answers: (Most didn't work because they have fixed rolling window instead of a dynamic window by "Date")
You could start with the approach giving the maximum count for each group (using pl.len()
within the aggregation) and post-process the Counter
column to make it's values increase within each group.
(
df
.rolling(index_column="Date", period="2d", group_by="Name")
.agg(
pl.len().alias("Counter")
)
.with_columns(
(pl.col("Counter") - pl.len() + 1 + pl.int_range(pl.len())).over("Name", "Date")
)
)
shape: (11, 3)
┌─────────┬────────────┬─────────┐
│ Name ┆ Date ┆ Counter │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ i64 │
╞═════════╪════════════╪═════════╡
│ John ┆ 2023-01-01 ┆ 1 │
│ John ┆ 2023-01-01 ┆ 2 │
│ John ┆ 2023-01-01 ┆ 3 │
│ John ┆ 2023-01-01 ┆ 4 │
│ John ┆ 2023-01-02 ┆ 5 │
│ John ┆ 2023-01-02 ┆ 6 │
│ John ┆ 2023-01-02 ┆ 7 │
│ John ┆ 2023-01-02 ┆ 8 │
│ John ┆ 2023-01-03 ┆ 5 │
│ John ┆ 2023-01-03 ┆ 6 │
│ New Guy ┆ 2023-01-01 ┆ 1 │
└─────────┴────────────┴─────────┘
Explanation. After the aggregation, the Counter
column will take a constant value V within each name-date-group of length L. The objective is to make Counter
take the values V-L+1 to V (one value for each row) instead.
Therefore, we can
Counter
and