I am trying to extract information from duplicates.
data = np.array([[100,1,0, 'GB'],[100,0,1, 'IT'],[101,1,0, 'CN'],[101,0,1, 'CN'],
[102,1,0, 'JP'],[102,0,1, 'CN'],[103,0,1, 'DE'],
[103,0,1, 'DE'],[103,1,0, 'VN'],[103,1,0, 'VN']])
df = pd.DataFrame(data, columns = ['wed_cert_id','spouse_1',
'spouse_2', 'nationality'])
I would like to categorise each wedding as either cross-national or not. In my actual data set there can be more than 2 spouses to a marriage.
My aim is to obtain a data frame like this:
or like this:
I have tried to find a way to filter the data using .duplicated() and trying to deny .duplicated() with a not operator, but have not succeed in working it out:
df = df.loc[df.wed_cert_id.duplicated(keep=False) ~df.nationality.duplicated(keep=False), :]
df = df.loc[df.wed_cert_id.duplicated(keep=False) not df.nationality.duplicated(keep=False), :]
Dropping the duplicates drops too many observations. My data set allows for >2 spouses per wedding, creating the potential for duplication:
df.drop_duplicates(subset=['wed_cert_id','nationality'], keep=False, inplace=True)
How do I do it?
Many thanks from now
I believe you need:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1'))
print(df)
Or:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1')
.mul(df[['spouse_1','spouse_2']].prod(1)))
print(df)
wed_cert_id spouse_1 spouse_2 nationality cross_national
0 100 1 0 GB 1
1 100 0 1 IT 1
2 101 1 0 CN 0
3 101 0 1 CN 0
4 102 1 0 JP 1
5 102 0 1 CN 1
6 103 0 1 DE 1
7 103 0 1 DE 1
8 103 1 0 VN 1
9 103 1 0 VN 1