Search code examples
pythonpython-polars

Polars merging list of lists into a comma separated string while removing duplicates


Theres a similar question already on this, but the answer does not solve the question.

df = pl.DataFrame({"id": [1, 2, 1], "name": ['jenobi', 'blah', 'jenobi'], 
                "company": [[['some company 1', 'some company2'], ['some company2']], 
                [['company 1'], ['company 2', 'company 3']],
                [['some company 1'], ['some company2', 'some company 1', 'some company 2']]]
               })

Dataframe follows the schema as above. Want to merge the lists of lists during a groupby and aggregate on the id and name.

Want the result to show a string concatenated value, for example jenobi should show the following company: "some company 1, some company2, some company 2".

Have tried doing a groupby agg on the company and flattening the result however this produces a panic error.

Based on jqurious comments, the issue with doing a flatten then a join is that the list is flattened. However, there are double quotes around the flattened sub-lists in the output.

enter image description here

This is produced from the following..

df.groupby("name").agg(pl.col("company").flatten().list.join(", "))
df.with_columns(pl.col("company").list.unique())

Ideally, the final result will show.. enter image description here

Panic Error

data = (
    pl.read_parquet(r"input.parquet")
    .select("id", "name", "company")
    .groupby("id", "name")
    .agg(
        pl.col("company").flatten().list.unique()
    )
)

enter image description here

Any suggestions?


Solution

  • From version 0.19.9 we can expand/enlarge the repr for list columns.

    pl.Config(fmt_table_cell_list_len=-1, fmt_str_lengths=120)
    

    You can .flatten() each level of nesting inside the .agg() and then call .unique()

    (df.group_by("name")
       .agg(pl.col("company").flatten().flatten().unique())
    )
    
    shape: (2, 2)
    ┌────────┬───────────────────────────────────────────────────────┐
    │ name   ┆ company                                               │
    │ ---    ┆ ---                                                   │
    │ str    ┆ list[str]                                             │
    ╞════════╪═══════════════════════════════════════════════════════╡
    │ blah   ┆ ["company 1", "company 2", "company 3"]               │
    │ jenobi ┆ ["some company2", "some company 2", "some company 1"] │
    └────────┴───────────────────────────────────────────────────────┘
    
    • Note: df.groupby was renamed to df.group_by in 0.19.0