Below is my data frame:
df = pd.DataFrame({
'P_ID': [348374, 348374, 348374, 1348374, 1348374, 1348374, 1348374],
'I_ISSUE_TYPE_CODE': ['CLAIM', 'RSA', 'RSA', 'EPQR', 'CLAIM', 'EPQR', 'EPQR'],
'A_b': [234, 5671, 8900, 3451, 660, 770, 660]
})
I want to check in column I_ISSUE_TYPE_CODE if it is RSA with respect to P_ID column then put 1 in new_column else put zero. if there is duplicate RSA in same P_ID then the second one will be zero. if there is no RSA in I_ISSUE_TYPE_CODE with respect to P_ID then put 1 at its 1st occurrence.
Below is my expected output
df = pd.DataFrame({
'P_ID': [348374, 348374, 348374, 1348374, 1348374, 1348374, 1348374],
'I_ISSUE_TYPE_CODE': ['CLAIM', 'RSA', 'RSA', 'EPQR', 'CLAIM', 'EPQR', 'EPQR'],
'A_b': [234, 5671, 8900, 3451, 660, 770, 660],
'New_column': [0, 1, 0, 1, 0, 0, 0]
})
Below code I have tried
df['New_column'] = df.groupby('P_ID')['I_ISSUE_TYPE_CODE'].apply(lambda x: (x == 'RSA').cumsum().eq(1).astype(int))
Still not getting the correct output
Use boolean arithmetic with help of duplicated
and groupby.transform('any')
, then convert the output to integer with astype
:
# is the value the first RSA per group?
m1 = (df['I_ISSUE_TYPE_CODE'].eq('RSA')
& ~df[['P_ID', 'I_ISSUE_TYPE_CODE']].duplicated()
)
# if there is no RSA in the group, identify the first index
m2 = ~df['P_ID'].duplicated() & ~m1.groupby(df['P_ID']).transform('any')
df['New_column'] = (m1|m2).astype(int)
Output:
P_ID I_ISSUE_TYPE_CODE A_b New_column
0 348374 CLAIM 234 0
1 348374 RSA 5671 1
2 348374 RSA 8900 0
3 1348374 EPQR 3451 1
4 1348374 CLAIM 660 0
5 1348374 EPQR 770 0
6 1348374 EPQR 660 0
Intermediates:
P_ID I_ISSUE_TYPE_CODE A_b New_column m1 m2
0 348374 CLAIM 234 0 False False
1 348374 RSA 5671 1 True False
2 348374 RSA 8900 0 False False
3 1348374 EPQR 3451 1 False True
4 1348374 CLAIM 660 0 False False
5 1348374 EPQR 770 0 False False
6 1348374 EPQR 660 0 False False
Alternative:
Same logic with a custom function for groupby.apply
:
def flag(g):
s = g['I_ISSUE_TYPE_CODE'].eq('RSA') & ~g['I_ISSUE_TYPE_CODE'].duplicated()
if s.any():
return s
return ~g['P_ID'].duplicated()
df['New_column'] = df.groupby('P_ID', group_keys=False).apply(flag).astype(int)