Search code examples
pythonpython-3.xpandasdataframegroup-by

New column based on the largest 3 elements in a group indexed by Group ID


I have a dataframe that looks like

Group_ID    probability 
34883       0.002676    
34883       0.17826266  
34883       0.01399753  
34883       0.04569782  
34883       0.02799506  
34883       0.02634829  
34883       0.02923014  
34883       0.13544669  
34883       0.07595718  
34883       0.19246604  
34883       0.20028818  
34883       0           
34883       0           
34883       0.07163442  
34897       0.03329843  
34897       0.07643979  
34897       0.09570681  
34897       0.00376963  
34897       0.01780105  
34897       0.0008377   
34897       0.08125654  
34897       0.10764398  
34897       0.25780105  
34897       0.10910995  
34897       0           
34897       0.02743455  
34897       0.18890052  
34897       0           

where the probabilities in probability sum to 1 for each Group_ID. I want to create a new column called top 3 which indicates the position of the largest 3 probabilities for each Group_ID and top3 = 1 if the row has the greatest probability for that Group_ID and zero otherwise. Hence the outcome looks like:

Group_ID    probability top3
34883       0.002676    0
34883       0.17826266  1
34883       0.01399753  0
34883       0.04569782  0
34883       0.02799506  0
34883       0.02634829  0
34883       0.02923014  0
34883       0.13544669  0
34883       0.07595718  0
34883       0.19246604  1
34883       0.20028818  1
34883       0           0
34883       0           0
34883       0.07163442  0
34897       0.03329843  0
34897       0.07643979  0
34897       0.09570681  0
34897       0.00376963  0
34897       0.01780105  0
34897       0.0008377   0
34897       0.08125654  0
34897       0.10764398  0
34897       0.25780105  1
34897       0.10910995  1
34897       0           0
34897       0.02743455  0
34897       0.18890052  1
34897       0           0

I googled for a bit and I think it might have something to do with idxmax but I am not sure how to proceed. Thank you in advance.


Solution

  • Use a groupy.rank in descending order, then keep the values ≤ (le) 3, finally convert the boolean Series to integer with astype:

    df['top3'] = (df.groupby('Group_ID')['probability']
                    .rank('dense', ascending=False).le(3).astype(int)
                 )
    

    Output:

        Group_ID  probability  top3
    0      34883     0.002676     0
    1      34883     0.178263     1
    2      34883     0.013998     0
    3      34883     0.045698     0
    4      34883     0.027995     0
    5      34883     0.026348     0
    6      34883     0.029230     0
    7      34883     0.135447     0
    8      34883     0.075957     0
    9      34883     0.192466     1
    10     34883     0.200288     1
    11     34883     0.000000     0
    12     34883     0.000000     0
    13     34883     0.071634     0
    14     34897     0.033298     0
    15     34897     0.076440     0
    16     34897     0.095707     0
    17     34897     0.003770     0
    18     34897     0.017801     0
    19     34897     0.000838     0
    20     34897     0.081257     0
    21     34897     0.107644     0
    22     34897     0.257801     1
    23     34897     0.109110     1
    24     34897     0.000000     0
    25     34897     0.027435     0
    26     34897     0.188901     1
    27     34897     0.000000     0