I have a large(>2gb) dataset that need to be filtered for duplicates like this: if values in columns A,B,C,D and E are identical, then the value of E in every duplicate row (there can be multiple duplicates) has to be changed to 0.
Dataset:
Column A | Column B | Column C | Column D | Column E |
---|---|---|---|---|
a | b | c | d | e |
a | b | c | d | e |
a | b | c | d | e |
Output:
Column A | Column B | Column C | Column D | Column E |
---|---|---|---|---|
a | b | c | d | e |
a | b | c | d | 0 |
a | b | c | d | 0 |
If need test duplicates in all columns use DataFrame.duplicated
with no parameter:
df.loc[df.duplicated(), 'Column E'] = 0
print (df)
Column A ColumnB Column C Column D Column E
0 a b c d e
1 a b c d 0
2 a b c d 0
If need test duplicates in columns from list cols
:
cols = ['Column A', 'ColumnB', 'Column C', 'Column D', 'Column E']
df.loc[df.duplicated(subset=cols), 'Column E'] = 0
print (df)
Column A ColumnB Column C Column D Column E
0 a b c d e
1 a b c d 0
2 a b c d 0