polars rolling by option not allowed?

I have a data frame of the type:

df = pl.LazyFrame({"day": [1,2,4,5,2,3,5,6], 'type': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], "value": [1, 0, 3, 4, 2, 2, 0, 1]})


day type    value
i64 str i64
1   "a" 1
2   "a" 0
4   "a" 3
5   "a" 4
2   "b" 2
3   "b" 2
5   "b" 0
6   "b" 1

I am trying to create a rolling sum variable, summing, for each different "type", the values in a two days window. Ideally, the resulting dataset would be the following:

day	type	value	rolling_sum
1	a	1	1
2	a	0	1
4	a	3	3
5	a	4	7
2	b	2	2
3	b	2	4
5	b	0	0
6	b	1	1

I tried using the following code:

df = df.with_columns(pl.col("value")
                     .rolling(index_column="day", by="type", period="2i")
                     .sum().alias("rolling_sum"))

but I get the error: "TypeError: rolling() got an unexpected keyword argument 'by'".

Could you help me fix it?

Solution

That's because in your code you're trying to use Expr.rolling() which doesn't have by parameter (strangely, it is mentioned in the documentation under check_sorted parameter - is it just not implemented yet?), instead of DataFrame.rolling().

If you'd restructure the code to use the latter then it works fine:

(
    df.rolling(
        index_column="day", by="type", period="2i"
    )
    .agg(
        pl.col('value').sum().alias("rolling_sum")
    )
)

┌──────┬─────┬─────────────┐
│ type ┆ day ┆ rolling_sum │
│ ---  ┆ --- ┆ ---         │
│ str  ┆ i64 ┆ i64         │
╞══════╪═════╪═════════════╡
│ a    ┆ 1   ┆ 1           │
│ a    ┆ 2   ┆ 1           │
│ a    ┆ 4   ┆ 3           │
│ a    ┆ 5   ┆ 7           │
│ b    ┆ 2   ┆ 2           │
│ b    ┆ 3   ┆ 4           │
│ b    ┆ 5   ┆ 0           │
│ b    ┆ 6   ┆ 1           │
└──────┴─────┴─────────────┘

If you need to have value column in your result, you can use Expr.rolling_sum() combined with Expr.over() instead (assuming your DataFrame is sorted by day already):

df.with_columns(
    pl.col("value")
    .rolling_sum(window_size=2,min_periods=0)
    .over("type")
    .alias('rolling_sum')
)

┌─────┬──────┬───────┬─────────────┐
│ day ┆ type ┆ value ┆ rolling_sum │
│ --- ┆ ---  ┆ ---   ┆ ---         │
│ i64 ┆ str  ┆ i64   ┆ i64         │
╞═════╪══════╪═══════╪═════════════╡
│ 1   ┆ a    ┆ 1     ┆ 1           │
│ 2   ┆ a    ┆ 0     ┆ 1           │
│ 4   ┆ a    ┆ 3     ┆ 3           │
│ 5   ┆ a    ┆ 4     ┆ 7           │
│ 2   ┆ b    ┆ 2     ┆ 2           │
│ 3   ┆ b    ┆ 2     ┆ 4           │
│ 5   ┆ b    ┆ 0     ┆ 2           │
│ 6   ┆ b    ┆ 1     ┆ 1           │
└─────┴──────┴───────┴─────────────┘

Ideally, I would probably expect Expr.rolling together with Expr.over to work:

# something like this
df.with_columns(
    pl.col("value")
    .rolling(index_column="day", period="2i")
    .sum()
    .over("type")
    .alias('rolling_sum')
)

# or this
df.set_sorted(['type','day']).with_columns(
    pl.col("value")
    .sum()
    .over('type')
    .rolling(index_column="day", period="2i")
    .alias('rolling_sum')
)

but unfortunately, it doesn't:

InvalidOperationError: rolling expression not allowed in aggregation

Update

Using rolling_sum() might not be something you want, if you plan your window to be based on days / weeks etc. In this case you can still use DataFrame.rolling() and combine it with Expr.last() inside of GroupBy.agg() to get the last value in the window:

(
    df.rolling(
        index_column="day", by="type", period="2i"
    )
    .agg(
        pl.col('value').last(),
        pl.col('value').sum().alias("rolling_sum")
    )
)

┌──────┬─────┬───────┬─────────────┐
│ type ┆ day ┆ value ┆ rolling_sum │
│ ---  ┆ --- ┆ ---   ┆ ---         │
│ str  ┆ i64 ┆ i64   ┆ i64         │
╞══════╪═════╪═══════╪═════════════╡
│ a    ┆ 1   ┆ 1     ┆ 1           │
│ a    ┆ 2   ┆ 0     ┆ 1           │
│ a    ┆ 4   ┆ 3     ┆ 3           │
│ a    ┆ 5   ┆ 4     ┆ 7           │
│ b    ┆ 2   ┆ 2     ┆ 2           │
│ b    ┆ 3   ┆ 2     ┆ 4           │
│ b    ┆ 5   ┆ 0     ┆ 0           │
│ b    ┆ 6   ┆ 1     ┆ 1           │
└──────┴─────┴───────┴─────────────┘