Theres a similar question already on this, but the answer does not solve the question.
df = pl.DataFrame({"id": [1, 2, 1], "name": ['jenobi', 'blah', 'jenobi'],
"company": [[['some company 1', 'some company2'], ['some company2']],
[['company 1'], ['company 2', 'company 3']],
[['some company 1'], ['some company2', 'some company 1', 'some company 2']]]
})
Dataframe follows the schema as above. Want to merge the lists of lists during a groupby and aggregate on the id and name.
Want the result to show a string concatenated value, for example jenobi should show the following company: "some company 1, some company2, some company 2".
Have tried doing a groupby agg on the company and flattening the result however this produces a panic error.
Based on jqurious comments, the issue with doing a flatten then a join is that the list is flattened. However, there are double quotes around the flattened sub-lists in the output.
This is produced from the following..
df.groupby("name").agg(pl.col("company").flatten().list.join(", "))
df.with_columns(pl.col("company").list.unique())
Ideally, the final result will show..
data = (
pl.read_parquet(r"input.parquet")
.select("id", "name", "company")
.groupby("id", "name")
.agg(
pl.col("company").flatten().list.unique()
)
)
Any suggestions?
From version 0.19.9
we can expand/enlarge the repr for list columns.
pl.Config(fmt_table_cell_list_len=-1, fmt_str_lengths=120)
You can .flatten()
each level of nesting inside the .agg()
and then call .unique()
(df.group_by("name")
.agg(pl.col("company").flatten().flatten().unique())
)
shape: (2, 2)
┌────────┬───────────────────────────────────────────────────────┐
│ name ┆ company │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪═══════════════════════════════════════════════════════╡
│ blah ┆ ["company 1", "company 2", "company 3"] │
│ jenobi ┆ ["some company2", "some company 2", "some company 1"] │
└────────┴───────────────────────────────────────────────────────┘
df.groupby
was renamed to df.group_by
in 0.19.0