Say I have:
data = {
'id': ['a', 'a', 'a', 'b', 'b', 'b', 'b'],
'd': [1,2,3,0,1,2,3],
'sales': [5,1,3,4,1,2,3],
}
I would like to add a column with a rolling mean with window size 2, with min_periods=2
, over 'id'
In Polars, I can do:
import polars as pl
df = pl.DataFrame(data)
df.with_columns(sales_rolling = pl.col('sales').rolling_mean(2).over('id'))
shape: (7, 4)
┌─────┬─────┬───────┬───────────────┐
│ id ┆ d ┆ sales ┆ sales_rolling │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪═══════╪═══════════════╡
│ a ┆ 1 ┆ 5 ┆ null │
│ a ┆ 2 ┆ 1 ┆ 3.0 │
│ a ┆ 3 ┆ 3 ┆ 2.0 │
│ b ┆ 0 ┆ 4 ┆ null │
│ b ┆ 1 ┆ 1 ┆ 2.5 │
│ b ┆ 2 ┆ 2 ┆ 1.5 │
│ b ┆ 3 ┆ 3 ┆ 2.5 │
└─────┴─────┴───────┴───────────────┘
What's the DuckDB equivalent? I've tried
import duckdb
duckdb.sql("""
select
*,
mean(sales) over (
partition by id
order by d
range between 1 preceding and 0 following
) as sales_rolling
from df
""").sort('id', 'd')
but get
┌─────────┬───────┬───────┬───────────────┐
│ id │ d │ sales │ sales_rolling │
│ varchar │ int64 │ int64 │ double │
├─────────┼───────┼───────┼───────────────┤
│ a │ 1 │ 5 │ 5.0 │
│ a │ 2 │ 1 │ 3.0 │
│ a │ 3 │ 3 │ 2.0 │
│ b │ 0 │ 4 │ 4.0 │
│ b │ 1 │ 1 │ 2.5 │
│ b │ 2 │ 2 │ 1.5 │
│ b │ 3 │ 3 │ 2.5 │
└─────────┴───────┴───────┴───────────────┘
This is very close, but duckdb still calculates the rolling mean when there's only a single value in the window. How can I replicate the min_periods=2
(default) behaviour from Polars?
You can use case
statement and count
:
duckdb.sql("""
from df
select
*,
case
when count(*) over rolling2 = 2 then
mean(sales) over rolling2
end as sales_rolling
window rolling2 as (
partition by id
order by d
rows between 1 preceding and current row
)
""").sort('id', 'd')
┌─────────┬───────┬───────┬───────────────┐
│ id │ d │ sales │ sales_rolling │
│ varchar │ int64 │ int64 │ double │
├─────────┼───────┼───────┼───────────────┤
│ a │ 1 │ 5 │ NULL │
│ a │ 2 │ 1 │ 3.0 │
│ a │ 3 │ 3 │ 2.0 │
│ b │ 0 │ 4 │ NULL │
│ b │ 1 │ 1 │ 2.5 │
│ b │ 2 │ 2 │ 1.5 │
│ b │ 3 │ 3 │ 2.5 │
└─────────┴───────┴───────┴───────────────┘
Note I've used named window and row framing here.