Search code examples
pythondataframeduplicatesdrop-duplicates

How to remove duplicates and keep values of all columns


I have a df like below

  Date         ID Colour ColourCode    Item ItemCode
0 2020-01-02  245   Blue         Bl   Apple      NaN
1 2020-01-02  245   Blue        NaN   Apple       Ap
2 2020-01-03  245   Blue         Bl  Orange      NaN
3 2020-01-03  245  Green        NaN   Apple       Ap
4 2020-01-05  472    Red         Re   Grape      NaN
5 2020-01-05  472    Red        NaN   Grape       Gr
6 2020-01-07  472    Red         Re  Banana      NaN

df.to_dict()
{'Date': {0: Timestamp('2020-01-02 00:00:00'), 1: Timestamp('2020-01-02 00:00:00'), 2: Timestamp('2020-01-03 00:00:00'), 3: Timestamp('2020-01-03 00:00:00'), 4: Timestamp('2020-01-05 00:00:00'), 5: Timestamp('2020-01-05 00:00:00'), 6: Timestamp('2020-01-07 00:00:00')}, 'ID': {0: 245, 1: 245, 2: 245, 3: 245, 4: 472, 5: 472, 6: 472}, 'Colour': {0: 'Blue', 1: 'Blue', 2: 'Blue', 3: 'Green', 4: 'Red', 5: 'Red', 6: 'Red'}, 'ColourCode': {0: 'Bl', 1: nan, 2: 'Bl', 3: nan, 4: 'Re', 5: nan, 6: 'Re'}, 'Item': {0: 'Apple', 1: 'Apple', 2: 'Orange', 3: 'Apple', 4: 'Grape', 5: 'Grape', 6: 'Banana'}, 'ItemCode': {0: nan, 1: 'Ap', 2: nan, 3: 'Ap', 4: nan, 5: 'Gr', 6: nan}}

Any rows having same Date, ID, Colour, Item would be considered duplicates. I just want to keep one row but this row would contain values of both columns ColourCode and ItemCode.
My expected result looks like

 Date          ID Colour ColourCode Item ItemCode
0 2020-01-02  245   Blue         Bl   Apple       Ap
1 2020-01-03  245   Blue         Bl  Orange      NaN
2 2020-01-03  245  Green        NaN   Apple       Ap
3 2020-01-05  472    Red         Re   Grape       Gr
4 2020-01-07  472    Red         Re  Banana      NaN

Are there any ways of doing it in Python?


Solution

  • Assuming , you can use groupby.first and reindex:

    out = (df
     .groupby(['Date', 'ID', 'Colour', 'Item'], as_index=False)
     .first()[df.columns]
     )
    

    Output:

            Date   ID Colour ColourCode    Item ItemCode
    0 2020-01-02  245   Blue         Bl   Apple       Ap
    1 2020-01-03  245   Blue         Bl  Orange     None
    2 2020-01-03  245  Green       None   Apple       Ap
    3 2020-01-05  472    Red         Re   Grape       Gr
    4 2020-01-07  472    Red         Re  Banana     None