Search code examples
pythonpandasgroup-byduplicates

Pandas - drop records if on specific column value is equal to previous record


I have a dataframe like following (sorted by id, time and status):

id status timestamp
111 A 29.08.2023 12:39
111 A 29.08.2023 12:45
111 B 29.08.2023 12:47
111 C 29.08.2023 12:50
111 A 29.08.2023 12:50
112 A 29.08.2023 12:50
112 B 29.08.2023 13:09
112 C 29.08.2023 13:40
112 B 29.08.2023 13:50
112 A 29.08.2023 13:55

I need to remove remove repeats in status sequences but only for cases when status repeats in next to each other records. Meaning that id 111 can have multiple statuses == 'A', but I need to remove second 'A' if previous status in time also was 'A'. So new table should be like this (remove second line with same status):

id status timestamp
111 A 29.08.2023 12:39
111 B 29.08.2023 12:47
111 C 29.08.2023 12:50
111 A 29.08.2023 12:50
112 A 29.08.2023 12:50
112 B 29.08.2023 13:09
112 C 29.08.2023 13:40
112 B 29.08.2023 13:50
112 A 29.08.2023 13:55

So in the end we have a unique sequences of statuses for each id without repeats.

Appreciate any help because I stuck with some very slow and blunt approach comparing each row with previous, but dataset is very huge, > 5M records.


Solution

  • Get difference of multiple columns shifted values and get rows by boolean indexing:

    out = df[df[['id','status']].ne(df[['id','status']].shift()).any(axis=1)]
    print (out)
        id status         timestamp
    0  111      A  29.08.2023 12:39
    2  111      B  29.08.2023 12:47
    3  111      C  29.08.2023 12:50
    4  111      A  29.08.2023 12:50
    5  112      A  29.08.2023 12:50
    6  112      B  29.08.2023 13:09
    7  112      C  29.08.2023 13:40
    8  112      B  29.08.2023 13:50
    9  112      A  29.08.2023 13:55