I am trying to write an aggregation routine where values in columns are concatenated based on a group_by statement.
I am trying to call a custom function to do the aggregation, and also trying to avoid using lambda (my understanding is – lambda functions only run in serial, hence performance would be slower). Here is my code:
def agg_ll_field(col_name) -> pl.Expr:
return ';'.join(pl.col(col_name).drop_nulls().unique().sort())
dfa = df.lazy()\
.group_by('SharedSourceSystem', 'FOPortfolioName').agg(
agg_ll_field('BookingUnits').alias('BOOKG_UNIT')
).collect()
I keep on getting an error:
agg_ll_field: Unexpected: can only join an iterable <class 'TypeError'>
Would anyone be able to help resolve this?
I tried using the map_groups function instead - that seems to work but I'm trying to avoid map_groups, since performance is supposed to be worse.
Here is the full example using str.join
:
import polars as pl
# Create a sample DataFrame
data = {
'SharedSourceSystem': ['A', 'A', 'B', 'B', 'B'],
'FOPortfolioName': ['X', 'X', 'Y', 'Y', 'Y'],
'BookingUnits': [1, 2, 2, 2, 3]
}
df = pl.DataFrame(data)
# Define the custom aggregation function
def agg_ll_field(col_name) -> pl.Expr:
return pl.col(col_name).drop_nulls().unique().sort().str.join(';')
# Apply the lazy groupby and aggregation
dfa = (
df.lazy()
.group_by('SharedSourceSystem', 'FOPortfolioName')
.agg(
agg_ll_field('BookingUnits').alias('BOOKG_UNIT')
)
.collect()
)
# Output
┌────────────────────┬─────────────────┬────────────┐
│ SharedSourceSystem ┆ FOPortfolioName ┆ BOOKG_UNIT │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════════════╪═════════════════╪════════════╡
│ A ┆ X ┆ 1;2 │
│ B ┆ Y ┆ 2;3 │
└────────────────────┴─────────────────┴────────────┘