Search code examples
pythonpython-polars

Two level group_by to calculate the average of max aggregation in Polars


Sample:

 import polars as pl
df = pl.DataFrame(
    {
        'NE' : ["P1", "P1", "P1", "P1", "P2", "P2", "P2", "P2", "P3", "P3", "P3", "P3" ],
        'DT' : ["D1", "D1", "D2", "D2", "D1", "D1", "D2", "D2", "D1", "D1", "D2", "D2" ],
        'TM' : ["H1", "H2", "H1", "H2", "H1", "H2", "H1", "H2", "H1", "H2", "H1", "H2" ],
        'UT' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ]
    }
)
df
shape: (12, 4)
┌─────┬─────┬─────┬─────┐
│ NE  ┆ DT  ┆ TM  ┆ UT  │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ P1  ┆ D1  ┆ H1  ┆ 1   │
│ P1  ┆ D1  ┆ H2  ┆ 2   │
│ P1  ┆ D2  ┆ H1  ┆ 3   │
│ P1  ┆ D2  ┆ H2  ┆ 4   │
│ …   ┆ …   ┆ …   ┆ …   │
│ P3  ┆ D1  ┆ H1  ┆ 9   │
│ P3  ┆ D1  ┆ H2  ┆ 10  │
│ P3  ┆ D2  ┆ H1  ┆ 11  │
│ P3  ┆ D2  ┆ H2  ┆ 12  │
└─────┴─────┴─────┴─────┘

I need to do two levels of aggregations on the DF with many columns and a few million rows

df.group_by('NE','DT').agg(pl.max('UT')).group_by('NE').agg(pl.max('UT').alias('UT_max'), pl.mean('UT').alias("UT_avg"))

shape: (3, 3)
┌─────┬────────┬────────┐
│ NE  ┆ UT_max ┆ UT_avg │
│ --- ┆ ---    ┆ ---    │
│ str ┆ i64    ┆ f64    │
╞═════╪════════╪════════╡
│ P2  ┆ 8      ┆ 7.0    │
│ P3  ┆ 12     ┆ 11.0   │
│ P1  ┆ 4      ┆ 3.0    │
└─────┴────────┴────────┘

Is there an efficient way to do that in single group_by ?

Thanks in advance.


Solution

  • In general, there's not really a hugely more efficient way to do it in terms of computational efficiency. You mentioned having many columns so let's say you had something like:

    df=pl.DataFrame(
        [
            pl.Series("NE", ['P1', 'P1', 'P1', 'P1', 'P2', 'P2', 'P2', 'P2', 'P3', 'P3', 'P3', 'P3'], dtype=pl.Utf8),
            pl.Series("DT", ['D1', 'D1', 'D2', 'D2', 'D1', 'D1', 'D2', 'D2', 'D1', 'D1', 'D2', 'D2'], dtype=pl.Utf8),
            pl.Series("TM", ['H1', 'H2', 'H1', 'H2', 'H1', 'H2', 'H1', 'H2', 'H1', 'H2', 'H1', 'H2'], dtype=pl.Utf8),
            pl.Series("UT", [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=pl.Int64),
            pl.Series("UT2", [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=pl.Int64),
        ]
    )
    

    You can use the selectors tool with suffix to save from typing out all the numeric columns.

    Since you're doing a second group_by with one of the same columns as the first group_by, you might see better overall performance by setting maintain_order=True in the first one.

    import polars.selectors as cs
    (df
     .group_by('NE','DT',maintain_order=True)
     .agg(cs.numeric().max())
     .group_by('NE')
     .agg(cs.numeric().max().suffix("_max"), 
          cs.numeric().mean().suffix("_avg"))
     )
    shape: (3, 5)
    ┌─────┬────────┬─────────┬────────┬─────────┐
    │ NE  ┆ UT_max ┆ UT2_max ┆ UT_avg ┆ UT2_avg │
    │ --- ┆ ---    ┆ ---     ┆ ---    ┆ ---     │
    │ str ┆ i64    ┆ i64     ┆ f64    ┆ f64     │
    ╞═════╪════════╪═════════╪════════╪═════════╡
    │ P3  ┆ 12     ┆ 12      ┆ 11.0   ┆ 11.0    │
    │ P2  ┆ 8      ┆ 8       ┆ 7.0    ┆ 7.0     │
    │ P1  ┆ 4      ┆ 4       ┆ 3.0    ┆ 3.0     │
    └─────┴────────┴─────────┴────────┴─────────┘