Search code examples
pythonpandasduplicatesdata-cleaningdata-comparison

How to compare rows show the reason of uniqueness in new column?


I have the following data:

Name Age Country Occupation Hobby
0 A 23 DE Job holder Fishing
1 A 23 DE Job holder Gardening
2 A 23 DE Job holder Fishing
3 A 23 DE Job holder Reading
4 B 15 SW Job holder Fishing
5 B 15 SW Job holder Playing
6 C 23 DD Job holder Coding
7 B 23 AA Job holder Fishing
8 D 34 GH Job holder Fishing
9 D 33 TR Other Fishing

When there is a duplicate in the "Name" columns, I would like to compare them and I would like to know the reason for their uniqueness. I want to add one new column which column should contain the responsible column header names for this uniqueness. If a row is found one time then write 'Unique' in the reason column.

The output should look like this:

Name Age Country Occupation Hobby Reason
0 A 23 DE Job holder Fishing Occupation, Hobby
1 A 23 DE Job holder Gardening Occupation, Hobby
2 A 23 DE Student Fishing Occupation, Hobby
3 A 23 DE Job holder Reading Occupation, Hobby
4 B 15 SW Job holder Fishing Hobby
5 B 15 SW Job holder Playing Hobby
6 C 23 DD Job holder Coding Unique
7 B 23 AA Job holder Fishing Age, Country
8 D 34 GH Job holder Fishing Age, Country, Occupation
9 D 33 TR Other Fishing Age, Occupation, Occupation

I trying this way:

dif = [i for i, (x,y) in enumerate(zip(df.loc[0].values, df.loc[9,:].values)) if x!=y ] df.iloc[:, dif]

but it's not giving the way output should look like. I am new in python, any help would be appreciated!


Solution

  • You can identify the non-unique reasons, then merge as a string.

    There are multiple ways to do this.

    Using groupby.nunique and dot product:

    df2 = df.groupby('Name').nunique()
    df['Reason'] = df['Name'].map(df2.gt(1).dot(df2.columns+', ')
                                     .str[:-2].replace('', 'Unique'))
    

    With groupby.nunique and melt:

    df2 = df.melt('Name')
    df['Reason'] = (df['Name']
                    .map(df2.groupby(['Name', 'variable'], as_index=False)['value'].nunique()
                            .query('value > 1').groupby('Name')['variable'].agg(', '.join))
                    .fillna('Unique')
                   )
    

    With groupby.apply:

    df['Reason'] = df['Name'].map(df.groupby('Name')
                                    .apply(lambda g: ', '.join(g.columns[g.nunique().gt(1)]))
                                    .replace('', 'Unique')
                                 )
    

    Output:

      Name  Age Country  Occupation      Hobby                    Reason
    0    A   23      DE  Job holder    Fishing         Occupation, Hobby
    1    A   23      DE  Job holder  Gardening         Occupation, Hobby
    2    A   23      DE     Student    Fishing         Occupation, Hobby
    3    A   23      DE  Job holder    Reading         Occupation, Hobby
    4    B   15      SW  Job holder    Fishing       Age, Country, Hobby
    5    B   15      SW  Job holder    Playing       Age, Country, Hobby
    6    C   23      DD  Job holder     Coding                    Unique
    7    B   23      AA  Job holder    Fishing       Age, Country, Hobby
    8    D   34      GH  Job holder    Fishing  Age, Country, Occupation
    9    D   33      TR       Other    Fishing  Age, Country, Occupation