I have a df created like this:
df = pl.from_repr("""
┌───────────────┬──────────────┬───────────────┐
│ schema_name ┆ table_name ┆ column_name │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════════╪══════════════╪═══════════════╡
│ test_schema ┆ test_table ┆ test_column │
│ test_schema ┆ test_table ┆ test_column │
│ test_schema_2 ┆ test_table_2 ┆ test_column_2 │
└───────────────┴──────────────┴───────────────┘
""")
I would like to use polars to aggregate the column_name
field by schema_name
and table-name
so that multiple values from column_name
are combined into one row. The target dataset is this:
shape: (2, 3)
┌───────────────┬──────────────┬──────────────────────────┐
│ schema_name ┆ table_name ┆ column_name │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════════╪══════════════╪══════════════════════════╡
│ test_schema_2 ┆ test_table_2 ┆ test_column_2 │
│ test_schema ┆ test_table ┆ test_column, test_column │
└───────────────┴──────────────┴──────────────────────────┘
I can aggregate the values into a list with this:
df.group_by('schema_name','table_name').agg(pl.col('column_name').alias('column_list'))
shape: (2, 3)
┌───────────────┬──────────────┬────────────────────────────────┐
│ schema_name ┆ table_name ┆ column_list │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[str] │
╞═══════════════╪══════════════╪════════════════════════════════╡
│ test_schema_2 ┆ test_table_2 ┆ ["test_column_2"] │
│ test_schema ┆ test_table ┆ ["test_column", "test_column"] │
└───────────────┴──────────────┴────────────────────────────────┘
How do I then convert the list field (column_list
) into a comma separated string? With pandas, I would do something like this:
df['column_list_string'] = [','.join(map(str, l)) for l in df['column_list']]
However, I can't figure out how to use .join()
in combination with polars .agg()
.
Alternatively, how would I go straight from multiple rows to one row without using the list as an intermediate step?
Polars has its own .str.join()
expression you can use.
df.group_by("schema_name", "table_name").agg(pl.col("column_name").str.join(", "))
shape: (2, 3)
┌───────────────┬──────────────┬──────────────────────────┐
│ schema_name | table_name | column_name │
│ --- | --- | --- │
│ str | str | str │
╞═══════════════╪══════════════╪══════════════════════════╡
│ test_schema_2 | test_table_2 | test_column_2 │
│ test_schema | test_table | test_column, test_column │
└───────────────┴──────────────┴──────────────────────────┘