python-polars Join Column Values into a concatenated string

I am trying to write an aggregation routine where values in columns are concatenated based on a group_by statement.

I am trying to call a custom function to do the aggregation, and also trying to avoid using lambda (my understanding is – lambda functions only run in serial, hence performance would be slower). Here is my code:

def agg_ll_field(col_name) -> pl.Expr:
        return ';'.join(pl.col(col_name).drop_nulls().unique().sort())
   
dfa = df.lazy()\
    .group_by('SharedSourceSystem', 'FOPortfolioName').agg(
        agg_ll_field('BookingUnits').alias('BOOKG_UNIT')
    ).collect()

I keep on getting an error:

agg_ll_field: Unexpected:  can only join an iterable   <class 'TypeError'>

Would anyone be able to help resolve this?

I tried using the map_groups function instead - that seems to work but I'm trying to avoid map_groups, since performance is supposed to be worse.

Solution

Here is the full example using str.join:

import polars as pl
# Create a sample DataFrame
data = {
    'SharedSourceSystem': ['A', 'A', 'B', 'B', 'B'],
    'FOPortfolioName': ['X', 'X', 'Y', 'Y', 'Y'],
    'BookingUnits': [1, 2, 2, 2, 3]
}

df = pl.DataFrame(data)

# Define the custom aggregation function
def agg_ll_field(col_name) -> pl.Expr:
    return pl.col(col_name).drop_nulls().unique().sort().str.join(';')

# Apply the lazy groupby and aggregation
dfa = (
    df.lazy()
      .group_by('SharedSourceSystem', 'FOPortfolioName')
      .agg(
          agg_ll_field('BookingUnits').alias('BOOKG_UNIT')
      )
      .collect()
)

# Output

┌────────────────────┬─────────────────┬────────────┐
│ SharedSourceSystem ┆ FOPortfolioName ┆ BOOKG_UNIT │
│ ---                ┆ ---             ┆ ---        │
│ str                ┆ str             ┆ str        │
╞════════════════════╪═════════════════╪════════════╡
│ A                  ┆ X               ┆ 1;2        │
│ B                  ┆ Y               ┆ 2;3        │
└────────────────────┴─────────────────┴────────────┘