Search code examples
pythongroup-bypython-polarsdata-wrangling

Python - Rolling Indexing in Polars library?


I'd like to ask around if anyone knows how to do rolling indexing in polars? I have personally tried a few solutions which did not work for me (I'll show them below):

What I'd like to do: Indexing the number of occurrences within the past X days by Name Example: Let's say I'd like to index occurrences within the past 2 days:

df = pl.from_repr("""
┌─────────┬─────────────────────┬─────────┐
│ Name    ┆ Date                ┆ Counter │
│ ---     ┆ ---                 ┆ ---     │
│ str     ┆ datetime[ns]        ┆ i64     │
╞═════════╪═════════════════════╪═════════╡
│ John    ┆ 2023-01-01 00:00:00 ┆ 1       │
│ John    ┆ 2023-01-01 00:00:00 ┆ 2       │
│ John    ┆ 2023-01-01 00:00:00 ┆ 3       │
│ John    ┆ 2023-01-01 00:00:00 ┆ 4       │
│ John    ┆ 2023-01-02 00:00:00 ┆ 5       │
│ John    ┆ 2023-01-02 00:00:00 ┆ 6       │
│ John    ┆ 2023-01-02 00:00:00 ┆ 7       │
│ John    ┆ 2023-01-02 00:00:00 ┆ 8       │
│ John    ┆ 2023-01-03 00:00:00 ┆ 5       │
│ John    ┆ 2023-01-03 00:00:00 ┆ 6       │
│ New Guy ┆ 2023-01-01 00:00:00 ┆ 1       │
└─────────┴─────────────────────┴─────────┘
""")

In this case, the counter resets to "1" starting from the past X days (e.g. for 3 Jan 23, it starts "1" from 2 Jan 23), or if a new name is detected

What I've tried:

(df.rolling(index_column='Date', period='2d', group_by='Name') 
   .agg((pl.col("Date").rank(method='ordinal')).alias("Counter"))
)

The above does not work because it outputs:

┌─────────┬─────────────────────┬──────────────────────────┐
│ Name    ┆ Date                ┆ Counter                  │
│ ---     ┆ ---                 ┆ ---                      │
│ str     ┆ datetime[ns]        ┆ list[u32]                │
╞═════════╪═════════════════════╪══════════════════════════╡
│ John    ┆ 2023-01-01 00:00:00 ┆ [1, 2, 3, 4]             │
│ John    ┆ 2023-01-01 00:00:00 ┆ [1, 2, 3, 4]             │
│ John    ┆ 2023-01-01 00:00:00 ┆ [1, 2, 3, 4]             │
│ John    ┆ 2023-01-01 00:00:00 ┆ [1, 2, 3, 4]             │
│ John    ┆ 2023-01-02 00:00:00 ┆ [1, 2, 3, 4, 5, 6, 7, 8] │
│ John    ┆ 2023-01-02 00:00:00 ┆ [1, 2, 3, 4, 5, 6, 7, 8] │
│ John    ┆ 2023-01-02 00:00:00 ┆ [1, 2, 3, 4, 5, 6, 7, 8] │
│ John    ┆ 2023-01-02 00:00:00 ┆ [1, 2, 3, 4, 5, 6, 7, 8] │
│ John    ┆ 2023-01-03 00:00:00 ┆ [1, 2, 3, 4, 5, 6]       │
│ John    ┆ 2023-01-03 00:00:00 ┆ [1, 2, 3, 4, 5, 6]       │
│ New Guy ┆ 2023-01-01 00:00:00 ┆ [1]                      │
└─────────┴─────────────────────┴──────────────────────────┘
(df.with_columns(Mask=1) 
   .with_columns(Counter=pl.col("Mask").rolling_sum_by(window_size='2d', by="Date"))
)

But it outputs:

┌─────────┬─────────────────────┬─────────┬──────┐
│ Name    ┆ Date                ┆ Counter ┆ mask │
│ ---     ┆ ---                 ┆ ---     ┆ ---  │
│ str     ┆ datetime[ns]        ┆ i32     ┆ i32  │
╞═════════╪═════════════════════╪═════════╪══════╡
│ John    ┆ 2023-01-01 00:00:00 ┆ 5       ┆ 1    │
│ John    ┆ 2023-01-01 00:00:00 ┆ 5       ┆ 1    │
│ John    ┆ 2023-01-01 00:00:00 ┆ 5       ┆ 1    │
│ John    ┆ 2023-01-01 00:00:00 ┆ 5       ┆ 1    │
│ John    ┆ 2023-01-02 00:00:00 ┆ 9       ┆ 1    │
│ John    ┆ 2023-01-02 00:00:00 ┆ 9       ┆ 1    │
│ John    ┆ 2023-01-02 00:00:00 ┆ 9       ┆ 1    │
│ John    ┆ 2023-01-02 00:00:00 ┆ 9       ┆ 1    │
│ John    ┆ 2023-01-03 00:00:00 ┆ 6       ┆ 1    │
│ John    ┆ 2023-01-03 00:00:00 ┆ 6       ┆ 1    │
│ New Guy ┆ 2023-01-01 00:00:00 ┆ 5       ┆ 1    │
└─────────┴─────────────────────┴─────────┴──────┘

And it also cannot handle "New Guy" correctly because rolling_sum cannot do group_by=["Name", "Date"]

df.with_columns(Counter = pl.col("Date").rank(method='ordinal').over("Name", "Date") )

The above code works correctly, but can only be used for indexing within the same day (i.e. period="1d")


Additional Notes

I also did this in Excel, and also using a brute/raw method of using a "for"-loop. Both worked perfectly, however they struggled with huge amounts of data.

What I read: Some references to help in answers: (Most didn't work because they have fixed rolling window instead of a dynamic window by "Date")


Solution

  • You could start with the approach giving the maximum count for each group (using pl.len() within the aggregation) and post-process the Counter column to make it's values increase within each group.

    (
        df
        .rolling(index_column="Date", period="2d", group_by="Name")
        .agg(
            pl.len().alias("Counter")
        )
        .with_columns(
            (pl.col("Counter") - pl.len() + 1 + pl.int_range(pl.len())).over("Name", "Date")
        )
    )
    
    shape: (11, 3)
    ┌─────────┬────────────┬─────────┐
    │ Name    ┆ Date       ┆ Counter │
    │ ---     ┆ ---        ┆ ---     │
    │ str     ┆ date       ┆ i64     │
    ╞═════════╪════════════╪═════════╡
    │ John    ┆ 2023-01-01 ┆ 1       │
    │ John    ┆ 2023-01-01 ┆ 2       │
    │ John    ┆ 2023-01-01 ┆ 3       │
    │ John    ┆ 2023-01-01 ┆ 4       │
    │ John    ┆ 2023-01-02 ┆ 5       │
    │ John    ┆ 2023-01-02 ┆ 6       │
    │ John    ┆ 2023-01-02 ┆ 7       │
    │ John    ┆ 2023-01-02 ┆ 8       │
    │ John    ┆ 2023-01-03 ┆ 5       │
    │ John    ┆ 2023-01-03 ┆ 6       │
    │ New Guy ┆ 2023-01-01 ┆ 1       │
    └─────────┴────────────┴─────────┘
    

    Explanation. After the aggregation, the Counter column will take a constant value V within each name-date-group of length L. The objective is to make Counter take the values V-L+1 to V (one value for each row) instead.

    Therefore, we can

    1. subtract L-1 from Counter and
    2. add an int range with increasing values from 0 to L-1.