Using a grouped z-score over a rolling window

I would like to calculate a z-score over a bin based on the data of a rolling look-back period.

Example Todays visitor amount during [9:30-9:35) should be z-score normalized based off the (mean, std) of the last 3 days of visitors that visited during [9:30-9:35).

My current attempts both raise InvalidOperationError. Is there a way in polars to calculate this?

import polars as pl

def z_score(col: str, over: str, alias: str):
    # calculate z-score normalized `col` over `over`
    return (
        (pl.col(col)-pl.col(col).over(over).mean()) / pl.col(col).over(over).std()
    ).alias(alias)

df = pl.from_dict(
    {
        "timestamp": pd.date_range("2019-12-02 9:30", "2019-12-02 12:30", freq="30s").union(
            pd.date_range("2019-12-03 9:30", "2019-12-03 12:30", freq="30s")
        ),
        "visitors": [(e % 2) + 1 for e in range(722)]
    }
# 5 minute bins for grouping [9:30-9:35) -> 930
).with_columns(
    pl.col("timestamp").dt.truncate(every="5m").dt.to_string("%H%M").cast(pl.Int32).alias("five_minute_bin")
).with_columns(
    pl.col("timestamp").dt.truncate(every="3d").alias("daytrunc")
)

# normalize visitor amount for each 5 min bin over the rolling 3 day window using z-score.


# not rolling but also wont work (InvalidOperationError: window expression not allowed in aggregation)
# df.with_columns(
#     z_score("visitors", "five_minute_bin", "normalized").over("daytrunc")
# )

# won't work either (InvalidOperationError: window expression not allowed in aggregation)
#df.rolling(index_column="daytrunc", period="3i").agg(z_score("visitors", "five_minute_bin", "normalized"))

For an example of 4 days of data with four data-points each lying in two time-bins ({0,0} - {0,1}), ({1,0} - {1,1})

Input:

Day 0: x_d0_{0,0}, x_d0_{0,1}, x_d0_{1,0}, x_d0_{1,1}
Day 1: x_d1_{0,0}, x_d1_{0,1}, x_d1_{1,0}, x_d1_{1,1}
Day 2: x_d2_{0,0}, x_d2_{0,1}, x_d2_{1,0}, x_d2_{1,1}
Day 3: x_d3_{0,0}, x_d3_{0,1}, x_d3_{1,0}, x_d3_{1,1}

Output:

Day 0: norm_x_d0_{0,0} = nan, norm_x_d0_{0,1} = nan, norm_x_d0_{1,0} = nan, norm_x_d0_{1,1} = nan
Day 1: norm_x_d1_{0,0} = nan, norm_x_d1_{0,1} = nan, norm_x_d1_{1,0} = nan, norm_x_d1_{1,1} = nan
Day 2: norm_x_d2_{0,0} = nan, norm_x_d2_{0,1} = nan, norm_x_d2_{1,0} = nan, norm_x_d2_{1,1} = nan
Day 3: norm_x_d3_{0,0} = (x_d3_{0,0} - np.mean([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}]) / np.std([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ...,  x_d3_{0,1}])) , ... ,

Solution

They key here is to use over to restrict your calculations to the five minute bins and then use the rolling functions to get the rolling mean and standard deviation over days restricted by those five minute bin keys. five_minute_bin works as in your code and I believe that a truncated day_bin is necessary so that, for example, 9:33 on one day will include 9:31 both 9:34 on the same and 9:31 from 2 days ago.

days = 5
pl.DataFrame(
    {
        "timestamp": pl.concat(
            pl.datetime_range(
                datetime(2019, 12, d, 9, 30), datetime(2019, 12, d, 12, 30), "30s", eager=True
            )
            for d in range(2, days + 2)
        ),
        "visitors": [(e % 2) + 1 for e in range(days * 361)],
    }
).with_columns(
    five_minute_bin=pl.col("timestamp").dt.truncate(every="5m").dt.to_string("%H%M"),
    day_bin=pl.col("timestamp").dt.truncate(every="1d"),
).with_columns(
    standardized_visitors=(
        (
            pl.col("visitors")
            - pl.col("visitors").rolling_mean_by("day_bin", window_size="3d", closed="right")
        )
        / pl.col("visitors").rolling_std_by("day_bin", window_size="3d", closed="right")
    ).over("five_minute_bin")
)

Now, that said, when trying out the code for this, I found polars doesn't handle non-unique values in the by-column in the rolling function correctly, so that the same values in the same 5-minute bin don't end up as the same standardized values. Opened bug report here: https://github.com/pola-rs/polars/issues/6691. For large amounts of real world data, this shouldn't actually matter that much, unless your data systematically differs in distribution within the 5 minute bins.