I have a prediction vs actual dataset over different time periods/segments. I would like to evaluate the prediction using polars groupby, but couldn't figure out what's the best way to do it. In pandas, it would look like
df.groupby("seg").apply(lambda x:eval_func(x["true"], x["pred"])
You can use pl.Expr.map_elements
on a column having multiple fields defined using pl.struct
. In the example below, I compute the mean absolute percentage error between a time series and corresponding forecast.
import polars as pl
df = pl.DataFrame({
"group": ["A"] * 3 + ["B"] * 3,
"actuals": [1, 2, 3, 1, 2, 3],
"preds": [1, 2, 3, 2, 3, 4],
})
shape: (6, 3)
┌───────┬─────────┬───────┐
│ group ┆ actuals ┆ preds │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════╪═════════╪═══════╡
│ A ┆ 1 ┆ 1 │
│ A ┆ 2 ┆ 2 │
│ A ┆ 3 ┆ 3 │
│ B ┆ 1 ┆ 2 │
│ B ┆ 2 ┆ 3 │
│ B ┆ 3 ┆ 4 │
└───────┴─────────┴───────┘
def mape(x):
actuals, preds = x.struct.field("actuals"), x.struct.field("preds")
return (actuals - preds).abs().mean()
df.group_by("group").agg(
pl.struct(pl.col("actuals"), pl.col("preds")).map_elements(mape).alias("mape")
)
shape: (2, 2)
┌───────┬──────┐
│ group ┆ mape │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════╪══════╡
│ A ┆ 0.0 │
│ B ┆ 1.0 │
└───────┴──────┘
Note that in this specific example, the result could've been computed relying purely on polars' expression API, which would be more efficient and idiomatic.