Search code examples
pythonpandasstringaggregate

Aggregating df columns but not duplicates


Is there a neat way to aggregate columns into a new column without duplicating information?

For example, if I have a df:

     Description  Information
  0       text1     text1
  1       text2     text3
  2       text4     text5

And I want to create a new column called 'Combined', which aggregates 'Description' and 'Information' to get:

     Description  Information  Combined
  0       text1     text1        text1
  1       text2     text3      text2 text3
  2       text4     text5      text4 text5

So far I have been using np.where and [mask] to check for duplicates before aggregating with df['Combined'] = df[['Description', 'Information']].agg(' '.join, axis=1)

Although this works, it is not practical on a larger scale, grateful if anyone knows of a simpler way!


Solution

  • You can first run unique:

    df['Combined'] = (df[['Description', 'Information']]
                      .agg(lambda x: ' '.join(x.unique()), axis=1)
                     )
    

    Output:

      Description Information     Combined
    0       text1       text1        text1
    1       text2       text3  text2 text3
    2       text4       text5  text4 text5