Search code examples
pandasduplicatesboolean-operations

How do I extract information from nested duplicates in pandas?


I am trying to extract information from duplicates.

    data = np.array([[100,1,0, 'GB'],[100,0,1, 'IT'],[101,1,0, 'CN'],[101,0,1, 'CN'],
             [102,1,0, 'JP'],[102,0,1, 'CN'],[103,0,1, 'DE'],
             [103,0,1, 'DE'],[103,1,0, 'VN'],[103,1,0, 'VN']])
    df = pd.DataFrame(data, columns = ['wed_cert_id','spouse_1',
                               'spouse_2', 'nationality'])

enter image description here

I would like to categorise each wedding as either cross-national or not. In my actual data set there can be more than 2 spouses to a marriage.

My aim is to obtain a data frame like this: enter image description here

or like this:

enter image description here

I have tried to find a way to filter the data using .duplicated() and trying to deny .duplicated() with a not operator, but have not succeed in working it out:

    df = df.loc[df.wed_cert_id.duplicated(keep=False) ~df.nationality.duplicated(keep=False), :]
    df = df.loc[df.wed_cert_id.duplicated(keep=False) not df.nationality.duplicated(keep=False), :]

Dropping the duplicates drops too many observations. My data set allows for >2 spouses per wedding, creating the potential for duplication:

    df.drop_duplicates(subset=['wed_cert_id','nationality'], keep=False, inplace=True)

How do I do it?

Many thanks from now


Solution

  • I believe you need:

    df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
                                .transform('nunique').gt(1).view('i1'))
    print(df)
    

    Or:

    df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
                           .transform('nunique').gt(1).view('i1')
                           .mul(df[['spouse_1','spouse_2']].prod(1)))
    print(df)
    

      wed_cert_id spouse_1 spouse_2 nationality  cross_national
    0         100        1        0          GB               1
    1         100        0        1          IT               1
    2         101        1        0          CN               0
    3         101        0        1          CN               0
    4         102        1        0          JP               1
    5         102        0        1          CN               1
    6         103        0        1          DE               1
    7         103        0        1          DE               1
    8         103        1        0          VN               1
    9         103        1        0          VN               1