Search code examples
pythonpython-3.xdataframegroup-by

mark duplicate as 0 in new column based on condition


I have dataframe as below

data =[['a',96.21623993,1],
['a',99.88211060,1],
['b',99.90232849,1],
['b',99.91232849,1],
['b',99.91928864,1],
['c',99.89162445,1],
['d',99.95264435,1],
['a',99.82862091,2],
['a',99.84466553,2],
['b',99.89685059,2],
['c',78.10614777,2],
['c',97.73305511,2],
['d',95.42383575,2],
]

df = pd.DataFrame(data, columns=['ename','score', 'groupid'])
df

I need to mark duplicate as 0 in new column but NOT the one with highest score. and should be grouping on groupid and ename.

I am looking to get output as below:

ename   score   groupid duplicate
    a   96.21624    1   TRUE
    a   99.882111   1   FALSE
    b   99.902328   1   TRUE
    b   99.912328   1   TRUE
    b   99.919289   1   FALSE
    c   99.891624   1   FALSE
    d   99.952644   1   FALSE
    a   99.828621   2   TRUE
    a   99.844666   2   FALSE
    b   99.896851   2   FALSE
    c   78.106148   2   TRUE
    c   97.733055   2   FALSE
    d   95.423836   2   FALSE

Solution

  • Here's how I would do it:

    df['duplicate'] = True
    df.loc[df.groupby(['groupid', 'ename'])['score'].idxmax().values, 'duplicate'] = False
    
    print(df)
    

    Prints:

       ename      score  groupid  duplicate
    0      a  96.216240        1       True
    1      a  99.882111        1      False
    2      b  99.902328        1       True
    3      b  99.912328        1       True
    4      b  99.919289        1      False
    5      c  99.891624        1      False
    6      d  99.952644        1      False
    7      a  99.828621        2       True
    8      a  99.844666        2      False
    9      b  99.896851        2      False
    10     c  78.106148        2       True
    11     c  97.733055        2      False
    12     d  95.423836        2      False