I want to merge rows in my df so I have one unique row per ID/Name with other values either summed (revenue) or concatenated (subject and product). However, where I am concatenating, I do not want duplicates to appear.
My df is similar to this:
ID Name Revenue Subject Product
123 John 125 Maths A
123 John 75 English B
246 Mary 32 History B
312 Peter 67 Maths A
312 Peter 39 Science A
I am using the following code to aggregate rows in my data frame
def f(x): return ' '.join(list(x))
df.groupby(['ID', 'Name']).agg(
{'Revenue': 'sum', 'Subject': f, 'Product': f}
)
This results in output like this:
ID Name Revenue Subject Product
123 John 200 Maths English A B
246 Mary 32 History B
312 Peter 106 Maths Science A A
How can I amend my code so that duplicates are removed in my concatenation? So in the example above the last row reads A
in Product and not A A
You are very close. First apply set on the items before listing and joining them. This will return only unique items
def f(x): return ' '.join(list(set(x)))
df.groupby(['ID', 'Name']).agg(
{'Revenue': 'sum', 'Subject': f, 'Product': f}
)