Search code examples
python-polars

polars groupby custom function on multiple columns


I have a prediction vs actual dataset over different time periods/segments. I would like to evaluate the prediction using polars groupby, but couldn't figure out what's the best way to do it. In pandas, it would look like

df.groupby("seg").apply(lambda x:eval_func(x["true"], x["pred"])

Solution

  • You can use pl.Expr.map_elements on a column having multiple fields defined using pl.struct. In the example below, I compute the mean absolute percentage error between a time series and corresponding forecast.

    import polars as pl
    
    df = pl.DataFrame({
        "group": ["A"] * 3 + ["B"] * 3,
        "actuals": [1, 2, 3, 1, 2, 3],
        "preds": [1, 2, 3, 2, 3, 4],
    })
    
    shape: (6, 3)
    ┌───────┬─────────┬───────┐
    │ group ┆ actuals ┆ preds │
    │ ---   ┆ ---     ┆ ---   │
    │ str   ┆ i64     ┆ i64   │
    ╞═══════╪═════════╪═══════╡
    │ A     ┆ 1       ┆ 1     │
    │ A     ┆ 2       ┆ 2     │
    │ A     ┆ 3       ┆ 3     │
    │ B     ┆ 1       ┆ 2     │
    │ B     ┆ 2       ┆ 3     │
    │ B     ┆ 3       ┆ 4     │
    └───────┴─────────┴───────┘
    
    def mape(x):
        actuals, preds = x.struct.field("actuals"), x.struct.field("preds")
        return (actuals - preds).abs().mean()
    
    df.group_by("group").agg(
        pl.struct(pl.col("actuals"), pl.col("preds")).map_elements(mape).alias("mape")
    )
    
    shape: (2, 2)
    ┌───────┬──────┐
    │ group ┆ mape │
    │ ---   ┆ ---  │
    │ str   ┆ f64  │
    ╞═══════╪══════╡
    │ A     ┆ 0.0  │
    │ B     ┆ 1.0  │
    └───────┴──────┘
    

    Note that in this specific example, the result could've been computed relying purely on polars' expression API, which would be more efficient and idiomatic.