Polars merging list of lists into a comma separated string while removing duplicates

Theres a similar question already on this, but the answer does not solve the question.

df = pl.DataFrame({"id": [1, 2, 1], "name": ['jenobi', 'blah', 'jenobi'], 
                "company": [[['some company 1', 'some company2'], ['some company2']], 
                [['company 1'], ['company 2', 'company 3']],
                [['some company 1'], ['some company2', 'some company 1', 'some company 2']]]
               })

Dataframe follows the schema as above. Want to merge the lists of lists during a groupby and aggregate on the id and name.

Want the result to show a string concatenated value, for example jenobi should show the following company: "some company 1, some company2, some company 2".

Have tried doing a groupby agg on the company and flattening the result however this produces a panic error.

Based on jqurious comments, the issue with doing a flatten then a join is that the list is flattened. However, there are double quotes around the flattened sub-lists in the output.

This is produced from the following..

df.groupby("name").agg(pl.col("company").flatten().list.join(", "))
df.with_columns(pl.col("company").list.unique())

Ideally, the final result will show..

Panic Error

data = (
    pl.read_parquet(r"input.parquet")
    .select("id", "name", "company")
    .groupby("id", "name")
    .agg(
        pl.col("company").flatten().list.unique()
    )
)

Any suggestions?

Solution

From version 0.19.9 we can expand/enlarge the repr for list columns.

pl.Config(fmt_table_cell_list_len=-1, fmt_str_lengths=120)

You can .flatten() each level of nesting inside the .agg() and then call .unique()

(df.group_by("name")
   .agg(pl.col("company").flatten().flatten().unique())
)

shape: (2, 2)
┌────────┬───────────────────────────────────────────────────────┐
│ name   ┆ company                                               │
│ ---    ┆ ---                                                   │
│ str    ┆ list[str]                                             │
╞════════╪═══════════════════════════════════════════════════════╡
│ blah   ┆ ["company 1", "company 2", "company 3"]               │
│ jenobi ┆ ["some company2", "some company 2", "some company 1"] │
└────────┴───────────────────────────────────────────────────────┘

Note: df.groupby was renamed to df.group_by in 0.19.0