Search code examples
pythonpython-polars

Polars Aggregate Multiple Rows into One Row


I have a df created like this:

df = pl.from_repr("""
┌───────────────┬──────────────┬───────────────┐
│ schema_name   ┆ table_name   ┆ column_name   │
│ ---           ┆ ---          ┆ ---           │
│ str           ┆ str          ┆ str           │
╞═══════════════╪══════════════╪═══════════════╡
│ test_schema   ┆ test_table   ┆ test_column   │
│ test_schema   ┆ test_table   ┆ test_column   │
│ test_schema_2 ┆ test_table_2 ┆ test_column_2 │
└───────────────┴──────────────┴───────────────┘
""")

I would like to use polars to aggregate the column_name field by schema_name and table-name so that multiple values from column_name are combined into one row. The target dataset is this:

shape: (2, 3)
┌───────────────┬──────────────┬──────────────────────────┐
│ schema_name   ┆ table_name   ┆ column_name              │
│ ---           ┆ ---          ┆ ---                      │
│ str           ┆ str          ┆ str                      │
╞═══════════════╪══════════════╪══════════════════════════╡
│ test_schema_2 ┆ test_table_2 ┆ test_column_2            │
│ test_schema   ┆ test_table   ┆ test_column, test_column │
└───────────────┴──────────────┴──────────────────────────┘

I can aggregate the values into a list with this:

df.group_by('schema_name','table_name').agg(pl.col('column_name').alias('column_list'))
shape: (2, 3)
┌───────────────┬──────────────┬────────────────────────────────┐
│ schema_name   ┆ table_name   ┆ column_list                    │
│ ---           ┆ ---          ┆ ---                            │
│ str           ┆ str          ┆ list[str]                      │
╞═══════════════╪══════════════╪════════════════════════════════╡
│ test_schema_2 ┆ test_table_2 ┆ ["test_column_2"]              │
│ test_schema   ┆ test_table   ┆ ["test_column", "test_column"] │
└───────────────┴──────────────┴────────────────────────────────┘

How do I then convert the list field (column_list) into a comma separated string? With pandas, I would do something like this:

df['column_list_string'] = [','.join(map(str, l)) for l in df['column_list']]

However, I can't figure out how to use .join() in combination with polars .agg().

Alternatively, how would I go straight from multiple rows to one row without using the list as an intermediate step?


Solution

  • Polars has its own .str.join() expression you can use.

    df.group_by("schema_name", "table_name").agg(pl.col("column_name").str.join(", "))
    
    shape: (2, 3)
    ┌───────────────┬──────────────┬──────────────────────────┐
    │ schema_name   | table_name   | column_name              │
    │ ---           | ---          | ---                      │
    │ str           | str          | str                      │
    ╞═══════════════╪══════════════╪══════════════════════════╡
    │ test_schema_2 | test_table_2 | test_column_2            │
    │ test_schema   | test_table   | test_column, test_column │
    └───────────────┴──────────────┴──────────────────────────┘