I have a data frame of the type:
df = pl.LazyFrame({"day": [1,2,4,5,2,3,5,6], 'type': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], "value": [1, 0, 3, 4, 2, 2, 0, 1]})
day type value
i64 str i64
1 "a" 1
2 "a" 0
4 "a" 3
5 "a" 4
2 "b" 2
3 "b" 2
5 "b" 0
6 "b" 1
I am trying to create a rolling sum variable, summing, for each different "type", the values in a two days window. Ideally, the resulting dataset would be the following:
day | type | value | rolling_sum |
---|---|---|---|
1 | a | 1 | 1 |
2 | a | 0 | 1 |
4 | a | 3 | 3 |
5 | a | 4 | 7 |
2 | b | 2 | 2 |
3 | b | 2 | 4 |
5 | b | 0 | 0 |
6 | b | 1 | 1 |
I tried using the following code:
df = df.with_columns(pl.col("value")
.rolling(index_column="day", by="type", period="2i")
.sum().alias("rolling_sum"))
but I get the error: "TypeError: rolling() got an unexpected keyword argument 'by'".
Could you help me fix it?
That's because in your code you're trying to use Expr.rolling()
which doesn't have by
parameter (strangely, it is mentioned in the documentation under check_sorted
parameter - is it just not implemented yet?), instead of DataFrame.rolling()
.
If you'd restructure the code to use the latter then it works fine:
(
df.rolling(
index_column="day", by="type", period="2i"
)
.agg(
pl.col('value').sum().alias("rolling_sum")
)
)
┌──────┬─────┬─────────────┐
│ type ┆ day ┆ rolling_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════╪═════╪═════════════╡
│ a ┆ 1 ┆ 1 │
│ a ┆ 2 ┆ 1 │
│ a ┆ 4 ┆ 3 │
│ a ┆ 5 ┆ 7 │
│ b ┆ 2 ┆ 2 │
│ b ┆ 3 ┆ 4 │
│ b ┆ 5 ┆ 0 │
│ b ┆ 6 ┆ 1 │
└──────┴─────┴─────────────┘
If you need to have value
column in your result, you can use Expr.rolling_sum()
combined with Expr.over()
instead (assuming your DataFrame is sorted by day
already):
df.with_columns(
pl.col("value")
.rolling_sum(window_size=2,min_periods=0)
.over("type")
.alias('rolling_sum')
)
┌─────┬──────┬───────┬─────────────┐
│ day ┆ type ┆ value ┆ rolling_sum │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 │
╞═════╪══════╪═══════╪═════════════╡
│ 1 ┆ a ┆ 1 ┆ 1 │
│ 2 ┆ a ┆ 0 ┆ 1 │
│ 4 ┆ a ┆ 3 ┆ 3 │
│ 5 ┆ a ┆ 4 ┆ 7 │
│ 2 ┆ b ┆ 2 ┆ 2 │
│ 3 ┆ b ┆ 2 ┆ 4 │
│ 5 ┆ b ┆ 0 ┆ 2 │
│ 6 ┆ b ┆ 1 ┆ 1 │
└─────┴──────┴───────┴─────────────┘
Ideally, I would probably expect Expr.rolling
together with Expr.over
to work:
# something like this
df.with_columns(
pl.col("value")
.rolling(index_column="day", period="2i")
.sum()
.over("type")
.alias('rolling_sum')
)
# or this
df.set_sorted(['type','day']).with_columns(
pl.col("value")
.sum()
.over('type')
.rolling(index_column="day", period="2i")
.alias('rolling_sum')
)
but unfortunately, it doesn't:
InvalidOperationError: rolling expression not allowed in aggregation
Update
Using rolling_sum()
might not be something you want, if you plan your window to be based on days / weeks etc.
In this case you can still use DataFrame.rolling()
and combine it with Expr.last()
inside of GroupBy.agg()
to get the last value in the window:
(
df.rolling(
index_column="day", by="type", period="2i"
)
.agg(
pl.col('value').last(),
pl.col('value').sum().alias("rolling_sum")
)
)
┌──────┬─────┬───────┬─────────────┐
│ type ┆ day ┆ value ┆ rolling_sum │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪═════╪═══════╪═════════════╡
│ a ┆ 1 ┆ 1 ┆ 1 │
│ a ┆ 2 ┆ 0 ┆ 1 │
│ a ┆ 4 ┆ 3 ┆ 3 │
│ a ┆ 5 ┆ 4 ┆ 7 │
│ b ┆ 2 ┆ 2 ┆ 2 │
│ b ┆ 3 ┆ 2 ┆ 4 │
│ b ┆ 5 ┆ 0 ┆ 0 │
│ b ┆ 6 ┆ 1 ┆ 1 │
└──────┴─────┴───────┴─────────────┘