Search code examples
pythonpandasduplicatesdroplines-of-code

Identify duplicated rows with different value in another column pandas dataframe


Suppose I have a dataframe of names and countries:

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
4   Maria       Lurdes      Espanha
5   Maria       Lurdes      Espanha
6   John        Page        USA
7   Felipe      Cardoso     Brasil
8   John        Page        USA
9   Felipe      Cardoso     Espanha
10  Steve       Xis         UK

I need a way to identify all people that have the same firstname and lastname that appears more than once in the dataframe but at least one of the records appears belonging to another country and return all duplicated rows. This way resulting in this dataframe:

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
7   Felipe      Cardoso     Brasil
9   Felipe      Cardoso     Espanha

What would be the best way to achieve it?


Solution

  • Use boolean indexing:

    # is the name present in several countries?
    m = df.groupby(['FirstName', 'LastName'])['Country'].transform('nunique').gt(1)
    
    out = df.loc[m]
    

    Output:

       ID FirstName LastName  Country
    0   1     Paulo   Cortez   Brasil
    1   2     Paulo   Cortez   Brasil
    2   3     Paulo   Cortez  Espanha
    6   7    Felipe  Cardoso   Brasil
    8   9    Felipe  Cardoso  Espanha