polars groupby custom function on multiple columns

I have a prediction vs actual dataset over different time periods/segments. I would like to evaluate the prediction using polars groupby, but couldn't figure out what's the best way to do it. In pandas, it would look like

df.groupby("seg").apply(lambda x:eval_func(x["true"], x["pred"])

Solution

You can use pl.Expr.map_elements on a column having multiple fields defined using pl.struct. In the example below, I compute the mean absolute percentage error between a time series and corresponding forecast.

import polars as pl

df = pl.DataFrame({
    "group": ["A"] * 3 + ["B"] * 3,
    "actuals": [1, 2, 3, 1, 2, 3],
    "preds": [1, 2, 3, 2, 3, 4],
})

shape: (6, 3)
┌───────┬─────────┬───────┐
│ group ┆ actuals ┆ preds │
│ ---   ┆ ---     ┆ ---   │
│ str   ┆ i64     ┆ i64   │
╞═══════╪═════════╪═══════╡
│ A     ┆ 1       ┆ 1     │
│ A     ┆ 2       ┆ 2     │
│ A     ┆ 3       ┆ 3     │
│ B     ┆ 1       ┆ 2     │
│ B     ┆ 2       ┆ 3     │
│ B     ┆ 3       ┆ 4     │
└───────┴─────────┴───────┘

def mape(x):
    actuals, preds = x.struct.field("actuals"), x.struct.field("preds")
    return (actuals - preds).abs().mean()

df.group_by("group").agg(
    pl.struct(pl.col("actuals"), pl.col("preds")).map_elements(mape).alias("mape")
)

shape: (2, 2)
┌───────┬──────┐
│ group ┆ mape │
│ ---   ┆ ---  │
│ str   ┆ f64  │
╞═══════╪══════╡
│ A     ┆ 0.0  │
│ B     ┆ 1.0  │
└───────┴──────┘

Note that in this specific example, the result could've been computed relying purely on polars' expression API, which would be more efficient and idiomatic.