Search code examples
pythonpandasdataframedrop-duplicates

I am trying to remove duplicate consequtive elements and keep the last value in data frame using pandas


There are two columns in the data frame and am trying to remove the consecutive element from column "a" and its corresponding element from column "b" while keeping only the last element.

import pandas as pd
a=[5,5,5,6,6,6,7,5,4,1,8,9]
b=[50,40,45,87,88,54,12,75,55,87,46,98]
df = pd.DataFrame(list(zip(a,b)), columns =['Patch', 'Reward'])
df=df.drop_duplicates(subset='Patch', keep="last")
df = df.set_index('Patch')
print (df)

when I run this I get:

       Reward
Patch        
6          54
7          12
5          75
4          55
1          87
8          46
9          98

however, what I want is:

Patch      Reward
5           45
6           54
7           12
5           75
4           55
1           87
8           46
9           98

PS: I don't want the duplicate elements repeating after another element or later in the series to be removed, but remove only consecutive duplicates while keeping the last to appear in the consecutive appearance. I also don't want it to be sorted, they should appear in the same sequence as in the list.


Solution

  • You can create a new column assigning an id to each group of consecutive elements and then doing the groupby operation followed by last aggregation.

    a=[5,5,5,6,6,6,7,5,4,1,8,9]
    b=[50,40,45,87,88,54,12,75,55,87,46,98]
    df = pd.DataFrame(list(zip(a,b)), columns =['Patch', 'Reward'])
    df["group_id"]=(df.Patch != df.Patch.shift()).cumsum()
    df = df.groupby("group_id").last()
    

    Output

    Patch  Reward 
    5      45
    6      54
    7      12
    5      75
    4      55
    1      87
    8      46
    9      98