Sample:
import polars as pl
df = pl.DataFrame(
{
'NE' : ["P1", "P1", "P1", "P1", "P2", "P2", "P2", "P2", "P3", "P3", "P3", "P3" ],
'DT' : ["D1", "D1", "D2", "D2", "D1", "D1", "D2", "D2", "D1", "D1", "D2", "D2" ],
'TM' : ["H1", "H2", "H1", "H2", "H1", "H2", "H1", "H2", "H1", "H2", "H1", "H2" ],
'UT' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ]
}
)
df
shape: (12, 4)
┌─────┬─────┬─────┬─────┐
│ NE ┆ DT ┆ TM ┆ UT │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ P1 ┆ D1 ┆ H1 ┆ 1 │
│ P1 ┆ D1 ┆ H2 ┆ 2 │
│ P1 ┆ D2 ┆ H1 ┆ 3 │
│ P1 ┆ D2 ┆ H2 ┆ 4 │
│ … ┆ … ┆ … ┆ … │
│ P3 ┆ D1 ┆ H1 ┆ 9 │
│ P3 ┆ D1 ┆ H2 ┆ 10 │
│ P3 ┆ D2 ┆ H1 ┆ 11 │
│ P3 ┆ D2 ┆ H2 ┆ 12 │
└─────┴─────┴─────┴─────┘
I need to do two levels of aggregations on the DF with many columns and a few million rows
df.group_by('NE','DT').agg(pl.max('UT')).group_by('NE').agg(pl.max('UT').alias('UT_max'), pl.mean('UT').alias("UT_avg"))
shape: (3, 3)
┌─────┬────────┬────────┐
│ NE ┆ UT_max ┆ UT_avg │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════╪════════╪════════╡
│ P2 ┆ 8 ┆ 7.0 │
│ P3 ┆ 12 ┆ 11.0 │
│ P1 ┆ 4 ┆ 3.0 │
└─────┴────────┴────────┘
Is there an efficient way to do that in single group_by ?
Thanks in advance.
In general, there's not really a hugely more efficient way to do it in terms of computational efficiency. You mentioned having many columns so let's say you had something like:
df=pl.DataFrame(
[
pl.Series("NE", ['P1', 'P1', 'P1', 'P1', 'P2', 'P2', 'P2', 'P2', 'P3', 'P3', 'P3', 'P3'], dtype=pl.String),
pl.Series("DT", ['D1', 'D1', 'D2', 'D2', 'D1', 'D1', 'D2', 'D2', 'D1', 'D1', 'D2', 'D2'], dtype=pl.String),
pl.Series("TM", ['H1', 'H2', 'H1', 'H2', 'H1', 'H2', 'H1', 'H2', 'H1', 'H2', 'H1', 'H2'], dtype=pl.String),
pl.Series("UT", [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=pl.Int64),
pl.Series("UT2", [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=pl.Int64),
]
)
You can use the selectors tool with suffix
to save from typing out all the numeric columns.
Since you're doing a second group_by
with one of the same columns as the first group_by
, you might see better overall performance by setting maintain_order=True
in the first one.
import polars.selectors as cs
(df
.group_by('NE','DT',maintain_order=True)
.agg(cs.numeric().max())
.group_by('NE')
.agg(cs.numeric().max().name.suffix("_max"),
cs.numeric().mean().name.suffix("_avg"))
)
shape: (3, 5)
┌─────┬────────┬─────────┬────────┬─────────┐
│ NE ┆ UT_max ┆ UT2_max ┆ UT_avg ┆ UT2_avg │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═════╪════════╪═════════╪════════╪═════════╡
│ P3 ┆ 12 ┆ 12 ┆ 11.0 ┆ 11.0 │
│ P2 ┆ 8 ┆ 8 ┆ 7.0 ┆ 7.0 │
│ P1 ┆ 4 ┆ 4 ┆ 3.0 ┆ 3.0 │
└─────┴────────┴─────────┴────────┴─────────┘