Search code examples
pythonpandasfor-loopgroup-byapply

I am writing python code and stuck in one condition


Below is my data frame:

df = pd.DataFrame({
    'P_ID': [348374, 348374, 348374, 1348374, 1348374, 1348374, 1348374],
    'I_ISSUE_TYPE_CODE': ['CLAIM', 'RSA', 'RSA', 'EPQR', 'CLAIM', 'EPQR', 'EPQR'],
    'A_b': [234, 5671, 8900, 3451, 660, 770, 660]
})

I want to check in column I_ISSUE_TYPE_CODE if it is RSA with respect to P_ID column then put 1 in new_column else put zero. if there is duplicate RSA in same P_ID then the second one will be zero. if there is no RSA in I_ISSUE_TYPE_CODE with respect to P_ID then put 1 at its 1st occurrence.

Below is my expected output

df = pd.DataFrame({
    'P_ID': [348374, 348374, 348374, 1348374, 1348374, 1348374, 1348374],
    'I_ISSUE_TYPE_CODE': ['CLAIM', 'RSA', 'RSA', 'EPQR', 'CLAIM', 'EPQR', 'EPQR'],
    'A_b': [234, 5671, 8900, 3451, 660, 770, 660],
    'New_column': [0, 1, 0, 1, 0, 0, 0]
})

Below code I have tried

df['New_column'] = df.groupby('P_ID')['I_ISSUE_TYPE_CODE'].apply(lambda x: (x == 'RSA').cumsum().eq(1).astype(int))

Still not getting the correct output


Solution

  • Use boolean arithmetic with help of duplicated and groupby.transform('any'), then convert the output to integer with astype:

    # is the value the first RSA per group?
    m1 = (df['I_ISSUE_TYPE_CODE'].eq('RSA')
         & ~df[['P_ID', 'I_ISSUE_TYPE_CODE']].duplicated()
         )
    
    # if there is no RSA in the group, identify the first index
    m2 = ~df['P_ID'].duplicated() & ~m1.groupby(df['P_ID']).transform('any')
    
    df['New_column'] = (m1|m2).astype(int)
    

    Output:

          P_ID I_ISSUE_TYPE_CODE   A_b  New_column
    0   348374             CLAIM   234           0
    1   348374               RSA  5671           1
    2   348374               RSA  8900           0
    3  1348374              EPQR  3451           1
    4  1348374             CLAIM   660           0
    5  1348374              EPQR   770           0
    6  1348374              EPQR   660           0
    

    Intermediates:

          P_ID I_ISSUE_TYPE_CODE   A_b  New_column     m1     m2
    0   348374             CLAIM   234           0  False  False
    1   348374               RSA  5671           1   True  False
    2   348374               RSA  8900           0  False  False
    3  1348374              EPQR  3451           1  False   True
    4  1348374             CLAIM   660           0  False  False
    5  1348374              EPQR   770           0  False  False
    6  1348374              EPQR   660           0  False  False
    

    Alternative:

    Same logic with a custom function for groupby.apply:

    def flag(g):
        s = g['I_ISSUE_TYPE_CODE'].eq('RSA') & ~g['I_ISSUE_TYPE_CODE'].duplicated()
        if s.any():
            return s
        return ~g['P_ID'].duplicated()
    
    df['New_column'] = df.groupby('P_ID', group_keys=False).apply(flag).astype(int)