Search code examples
pythonpandasgroup-bypython-polarsdata-wrangling

Python - Rolling Indexing in Polars library?


I'd like to ask around if anyone knows how to do rolling indexing in polars? I have personally tried a few solutions which did not work for me (I'll show them below):

What I'd like to do: Indexing the number of occurrences within the past X days by Name Example: Let's say I'd like to index occurrences within the past 2 days:

Name Date Counter
John 1 Jan 23 1
John 1 Jan 23 2
John 1 Jan 23 3
John 1 Jan 23 4
John 2 Jan 23 5
John 2 Jan 23 6
John 2 Jan 23 7
John 2 Jan 23 8
John 3 Jan 23 5
John 3 Jan 23 6
New Guy 1 Jan 23 1

In this case, the counter resets to "1" starting from the past X days (e.g. for 3 Jan 23, it starts "1" from 2 Jan 23), or if a new name is detected

What I've tried:

df.groupby_rolling(index_column='Date', period='2d', by='Name', check_sorted=False).agg((pl.col("Date").rank(method='ordinal')).alias("Counter"))

The above does not work because it outputs:

Name Date Counter
John 1 Jan 23 1,2,3,4
John 1 Jan 23 1,2,3,4
John 1 Jan 23 1,2,3,4
John 1 Jan 23 1,2,3,4
John 2 Jan 23 1...8
John 2 Jan 23 1...8
John 2 Jan 23 1...8
John 2 Jan 23 1...8
John 3 Jan 23 1...6
John 3 Jan 23 1...6
New Guy 1 Jan 23 1
df.with_columns( Counter=pl.col("mask").rolling_sum(window_size='2d', by="Date") )

Where I made a column "Mask" which is just a column of "1"s, and tried to sum them, but it outputs:

Name Date Mask Counter
John 1 Jan 23 1 4
John 1 Jan 23 1 4
John 1 Jan 23 1 4
John 1 Jan 23 1 4
John 2 Jan 23 1 8
John 2 Jan 23 1 8
John 2 Jan 23 1 8
John 2 Jan 23 1 8
John 3 Jan 23 1 6
John 3 Jan 23 1 6

And it also cannot handle "New Guy" correctly because rolling_sum cannot do by="Name", "Date"

df.with_columns(Counter = pl.col("Date").rank(method='ordinal').over(["Name", "Date"]) )

The above code works correctly, but can only be used for indexing within the same day (i.e. period="1d")

Additional Notes: I also did this in Excel, and also using a brute/raw method of using a "for"-loop. Both worked perfectly, however they struggled with huge amounts of data.

What I read: Some references to help in answers: (Most didn't work because they have fixed rolling window instead of a dynamic window by "Date")

How to implement rolling rank in Polars version 0.19

https://github.com/pola-rs/polars/issues/4808

How to do group_by_rolling grouped by day by hour in polars in Python?

How to groupby and rolling in polars?

https://docs.pola.rs/py-polars/html/reference/series/api/polars.Series.rank.html

https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_rolling.html


Solution

  • You could start with the approach giving the maximum count for each group (using pl.len() within the aggregation) and post-process the Counter column to make it's values increase within each group.

    (
        df
        .rolling(index_column="Date", period="2d", group_by="Name", check_sorted=False)
        .agg(
            pl.len().alias("Counter")
        )
        .with_columns(
            (pl.col("Counter") - pl.len() + 1 + pl.int_range(pl.len())).over("Name", "Date")
        )
    )
    
    shape: (11, 3)
    ┌─────────┬────────────┬─────────┐
    │ Name    ┆ Date       ┆ Counter │
    │ ---     ┆ ---        ┆ ---     │
    │ str     ┆ date       ┆ i64     │
    ╞═════════╪════════════╪═════════╡
    │ John    ┆ 2023-01-01 ┆ 1       │
    │ John    ┆ 2023-01-01 ┆ 2       │
    │ John    ┆ 2023-01-01 ┆ 3       │
    │ John    ┆ 2023-01-01 ┆ 4       │
    │ John    ┆ 2023-01-02 ┆ 5       │
    │ John    ┆ 2023-01-02 ┆ 6       │
    │ John    ┆ 2023-01-02 ┆ 7       │
    │ John    ┆ 2023-01-02 ┆ 8       │
    │ John    ┆ 2023-01-03 ┆ 5       │
    │ John    ┆ 2023-01-03 ┆ 6       │
    │ New Guy ┆ 2023-01-01 ┆ 1       │
    └─────────┴────────────┴─────────┘
    

    Explanation. After the aggregation, the Counter column will take a constant value V within each name-date-group of length L. The objective is to make Counter take the values V-L+1 to V (one value for each row) instead.

    Therefore, we can

    1. subtract L-1 from Counter and
    2. add an int range with increasing values from 0 to L-1.