python list duplicates multilabel-classification data-preprocessing

Create a df new column which includes a list

I'm working on a multi-label image classifaction task. I have a dataframe with two columns (id and labels). I want to create a new column, which checks the ids for duplicates and if there is a duplicate (which is the case) the additional label should be assigned to the new column. The result should be a new column including all labels. Im struggling to write the labels in a new column as a list. Does anyone can support me here?

My df has the following structures:

| id       | labels         |
| -------- | -------------- |
| x.jpg    | label_1        |
| x.jpg    | label_2        |

New dataframe

| id       | labels         | all_labels       |
| -------- | -------------- |-------------------
| x.jpg    | label_1        | [label_1, label_2, and other if existent]
| x.jpg    | label_2        |

Solution

I think this does what you want although the format is a bit different:

newdf = df.groupby('id')['labels'].agg(list).reset_index(name='labels')

produces

      id              labels
0  x.jpg  [label_1, label_2]
1  y.jpg           [label_3]