I have a DataFrame like this:
import polars as pl
df = pl.DataFrame({"x": [1.2, 1.3, 3.4, 3.5]})
df
# shape: (3, 1)
# ┌─────┐
# │ a │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 1.2 │
# │ 1.3 │
# │ 3.4 │
# │ 3.5 │
# └─────┘
I would like to make a rolling aggregation using .rolling()
so that each row uses a window [-2:1]:
shape: (4, 2)
┌─────┬───────────────────┐
│ x ┆ y │
│ --- ┆ --- │
│ f64 ┆ list[f64] │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3] │
│ 1.3 ┆ [1.2, 1.3, 3.4] │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5] │
└─────┴───────────────────┘
So far, I managed to do this with the following code:
df.with_row_index("index").with_columns(
y = pl.col("x").rolling(index_column = "index", period = "4i", offset = "-3i")
).drop("index")
However this requires manually creating a column index
and then removing it after the operation. Is there a way to achieve the same result in a single with_columns()
call?
You can use concat_list
with shift
(
df
.with_columns(
y=pl.concat_list(
pl.col('x').shift(x)
for x in range(2,-2,-1)
)
.list.drop_nulls()
)
)
shape: (4, 2)
┌─────┬───────────────────┐
│ x ┆ y │
│ --- ┆ --- │
│ f64 ┆ list[f64] │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3] │
│ 1.3 ┆ [1.2, 1.3, 3.4] │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5] │
└─────┴───────────────────┘
There are a couple things to note here.
shift
is positive, that means to go backwards which is the opposite of your notation.range
can count backwards with (start, stop, increment) but stop
is non-inclusive so when entering that parameter, it needs an extra -1.concat_list
you need to manually drop the nulls that it will have for items at the beginning and end of the series.As always, you can wrap this into a function, including a translation of your preferred notation to what you actually need in range
for it to work.
from typing import Sequence
def my_roll(in_column: str | pl.Expr, window: Sequence):
if isinstance(in_column, str):
in_column = pl.col(in_column)
pl_window = range(-window[0], -window[1] - 1, -1)
return pl.concat_list(in_column.shift(x) for x in pl_window).list.drop_nulls()
which then allows you to do
df.with_columns(y=my_roll("x", [-2,1]))
If you don't care about static typing you can even monkey patch it to pl.Expr
like this pl.Expr.my_roll = my_roll
and then do df.with_columns(y=pl.col("x").my_roll([-2,1]))
but your pylance/pyright/mypy/etc will complain about it not existing.
You can combine the built in way featuring .with_row_index
and .rolling
into a .map_batches
that just turns your column into a df and spits back the series you care about.
def my_roll(in_column: str | pl.Expr, window):
if isinstance(in_column, str):
in_column = pl.col(in_column)
period = f"{window[1]-window[0]+1}i"
offset = f"{window[0]-1}i"
return in_column.map_batches(
lambda s: (
s.to_frame()
.with_row_index()
.select(
pl.col(s.name).rolling(
index_column="index",
period=period,
offset=offset
)
)
.get_column(s.name)
)
)
The way this works is that map_batches will turn your column into a Series and then run a function on it where the function returns another Series. If we make the function turn that Series into a DF, then attach the row_index, do the rolling, and get the resultant Series then that gives you exactly what you want all contained in an expression. It should be just as performant as the verbose way, assuming you don't have any other use of the row_index.
then you do
df.with_columns(y=my_roll("x", [-2,1]))