Search code examples
pythonpandasconcatenationaggregate

Aggregating rows in a data frame and eliminating duplicates


I want to merge rows in my df so I have one unique row per ID/Name with other values either summed (revenue) or concatenated (subject and product). However, where I am concatenating, I do not want duplicates to appear.

My df is similar to this:

ID   Name   Revenue   Subject   Product
123  John   125       Maths     A
123  John   75        English   B
246  Mary   32        History   B
312  Peter  67        Maths     A
312  Peter  39        Science   A

I am using the following code to aggregate rows in my data frame

def f(x): return ' '.join(list(x))

df.groupby(['ID', 'Name']).agg( 
  {'Revenue': 'sum', 'Subject': f, 'Product': f} 
)

This results in output like this:

ID   Name   Revenue   Subject        Product
123 John    200       Maths English  A B
246 Mary    32        History        B
312 Peter   106       Maths Science  A A

How can I amend my code so that duplicates are removed in my concatenation? So in the example above the last row reads A in Product and not A A


Solution

  • You are very close. First apply set on the items before listing and joining them. This will return only unique items

    def f(x): return ' '.join(list(set(x)))
    
    df.groupby(['ID', 'Name']).agg( 
      {'Revenue': 'sum', 'Subject': f, 'Product': f} 
    )