Search code examples
pandasdataframeduplicates

Changing one value of duplicate row in pandas


I have a large(>2gb) dataset that need to be filtered for duplicates like this: if values in columns A,B,C,D and E are identical, then the value of E in every duplicate row (there can be multiple duplicates) has to be changed to 0.

Dataset:

Column A Column B Column C Column D Column E
a b c d e
a b c d e
a b c d e

Output:

Column A Column B Column C Column D Column E
a b c d e
a b c d 0
a b c d 0

Solution

  • If need test duplicates in all columns use DataFrame.duplicated with no parameter:

    df.loc[df.duplicated(), 'Column E'] = 0
    print (df)
      Column A ColumnB Column C Column D Column E
    0        a       b        c        d        e
    1        a       b        c        d        0
    2        a       b        c        d        0
    

    If need test duplicates in columns from list cols:

    cols = ['Column A', 'ColumnB', 'Column C', 'Column D', 'Column E']
    df.loc[df.duplicated(subset=cols), 'Column E'] = 0
    print (df)
      Column A ColumnB Column C Column D Column E
    0        a       b        c        d        e
    1        a       b        c        d        0
    2        a       b        c        d        0