I have dataframe as below
data =[['a',96.21623993,1],
['a',99.88211060,1],
['b',99.90232849,1],
['b',99.91232849,1],
['b',99.91928864,1],
['c',99.89162445,1],
['d',99.95264435,1],
['a',99.82862091,2],
['a',99.84466553,2],
['b',99.89685059,2],
['c',78.10614777,2],
['c',97.73305511,2],
['d',95.42383575,2],
]
df = pd.DataFrame(data, columns=['ename','score', 'groupid'])
df
I need to mark duplicate as 0 in new column but NOT the one with highest score. and should be grouping on groupid and ename.
I am looking to get output as below:
ename score groupid duplicate
a 96.21624 1 TRUE
a 99.882111 1 FALSE
b 99.902328 1 TRUE
b 99.912328 1 TRUE
b 99.919289 1 FALSE
c 99.891624 1 FALSE
d 99.952644 1 FALSE
a 99.828621 2 TRUE
a 99.844666 2 FALSE
b 99.896851 2 FALSE
c 78.106148 2 TRUE
c 97.733055 2 FALSE
d 95.423836 2 FALSE
Here's how I would do it:
df['duplicate'] = True
df.loc[df.groupby(['groupid', 'ename'])['score'].idxmax().values, 'duplicate'] = False
print(df)
Prints:
ename score groupid duplicate
0 a 96.216240 1 True
1 a 99.882111 1 False
2 b 99.902328 1 True
3 b 99.912328 1 True
4 b 99.919289 1 False
5 c 99.891624 1 False
6 d 99.952644 1 False
7 a 99.828621 2 True
8 a 99.844666 2 False
9 b 99.896851 2 False
10 c 78.106148 2 True
11 c 97.733055 2 False
12 d 95.423836 2 False