Search code examples
pythonpython-polars

How to translate a nested groupby apply aggregation from pandas in to polars?


Im transitioning from pandas, so please excuse my non-parallelized brain. Suppose we have following pandas code:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    val: np.random.randint(1,5,100) for val in ['a','b','c','d','x','y','z']
})    

df.groupby('a').apply(lambda df:
    df.sort_values('c')
      .groupby('d')
      [['x','y','z']]
      .agg(['max','mean','median'])
)

Output (with smooshed multiindex to paste it here):

a d 0 1 2 3 4 5 6 7 8
x x x y y y z z z
sum mean median sum mean median sum mean median
1 1 15.0 3.75 4.0 12.0 3.0 3.5 12.0 3.0 3.0
1 2 9.0 3.0 3.0 5.0 1.666667 1.0 9.0 3.0 4.0
1 3 33.0 3.0 3.0 30.0 2.727273 3.0 27.0 2.454545 2.0
1 4 23.0 2.8750 3.0 16.0 2.0 2.0 15.0 1.8750 1.0
2 1 18.0 2.571429 2.0 13.0 1.857143 2.0 18.0 2.571429 3.0
2 2 18.0 2.0 1.0 23.0 2.555556 2.0 25.0 2.777778 3.0
2 3 11.0 3.666667 4.0 9.0 3.0 3.0 9.0 3.0 4.0
2 4 3.0 1.50 1.50 6.0 3.0 3.0 4.0 2.0 2.0
3 1 28.0 2.80 3.0 21.0 2.10 2.0 29.0 2.90 3.0
3 2 13.0 2.166667 2.0 19.0 3.166667 3.0 18.0 3.0 3.0
3 3 16.0 1.777778 2.0 22.0 2.444444 3.0 32.0 3.555556 4.0
3 4 20.0 2.222222 2.0 23.0 2.555556 2.0 23.0 2.555556 3.0
4 1 9.0 2.250 2.0 10.0 2.50 2.50 5.0 1.250 1.0
4 2 19.0 3.166667 3.0 8.0 1.333333 1.0 22.0 3.666667 4.0
4 3 10.0 2.0 1.0 14.0 2.80 3.0 15.0 3.0 3.0
4 4 9.0 2.250 2.0 12.0 3.0 3.0 10.0 2.50 2.50

How to rewrite it in polars?

The core idea of the exercise is that in apply i can do something with the whole dataframe group, e.g. sort it and then aggregate (which doesnt make sense, i know, but the idea is freedom to do whatever). Do i lose this freedom if i want my code to be parallelizable or is there a way to catch the whole group? I tried pl.all() but couldnt figure out the trick to at least sort each sub-df


Solution

  • It's unclear why you have a groupby inside apply but the same output is generated from:

    df.groupby(['a', 'd'])[['x', 'y', 'z']].agg(['max', 'mean', 'median'])
    

    In which case it is a simple translation.

    Polars does not have a MultiIndex (or Index) so we add a suffix to the column names instead.

    df = pl.from_pandas(df) # convert your example to Polars
    
    df.group_by("a", "d").agg(
        pl.col("x", "y", "z").max().name.suffix("_max"),
        pl.col("x", "y", "z").mean().name.suffix("_mean"),
        pl.col("x", "y", "z").median().name.suffix("_median")
    )
    
    shape: (16, 11)
    ┌─────┬─────┬───────┬───────┬───┬──────────┬──────────┬──────────┬──────────┐
    │ a   ┆ b   ┆ x_max ┆ y_max ┆ … ┆ z_mean   ┆ x_median ┆ y_median ┆ z_median │
    │ --- ┆ --- ┆ ---   ┆ ---   ┆   ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
    │ i64 ┆ i64 ┆ i64   ┆ i64   ┆   ┆ f64      ┆ f64      ┆ f64      ┆ f64      │
    ╞═════╪═════╪═══════╪═══════╪═══╪══════════╪══════════╪══════════╪══════════╡
    │ 3   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 2.571429 ┆ 3.0      ┆ 2.0      ┆ 3.0      │
    │ 4   ┆ 1   ┆ 4     ┆ 4     ┆ … ┆ 2.625    ┆ 3.5      ┆ 3.0      ┆ 3.0      │
    │ 2   ┆ 4   ┆ 4     ┆ 4     ┆ … ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 4.0      │
    │ 1   ┆ 4   ┆ 3     ┆ 3     ┆ … ┆ 1.6      ┆ 2.0      ┆ 3.0      ┆ 2.0      │
    │ 3   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.6      ┆ 3.0      ┆ 2.0      ┆ 3.0      │
    │ …   ┆ …   ┆ …     ┆ …     ┆ … ┆ …        ┆ …        ┆ …        ┆ …        │
    │ 1   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 2.875    ┆ 2.0      ┆ 2.0      ┆ 3.0      │
    │ 1   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.0      ┆ 4.0      ┆ 2.0      ┆ 2.0      │
    │ 1   ┆ 1   ┆ 4     ┆ 4     ┆ … ┆ 2.0      ┆ 3.0      ┆ 3.0      ┆ 1.0      │
    │ 4   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.2      ┆ 1.0      ┆ 3.0      ┆ 2.0      │
    │ 4   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 3.0      ┆ 1.5      ┆ 2.0      ┆ 3.0      │
    └─────┴─────┴───────┴───────┴───┴──────────┴──────────┴──────────┴──────────┘