I have a dataframe like following (sorted by id, time and status):
id | status | timestamp |
---|---|---|
111 | A | 29.08.2023 12:39 |
111 | A | 29.08.2023 12:45 |
111 | B | 29.08.2023 12:47 |
111 | C | 29.08.2023 12:50 |
111 | A | 29.08.2023 12:50 |
112 | A | 29.08.2023 12:50 |
112 | B | 29.08.2023 13:09 |
112 | C | 29.08.2023 13:40 |
112 | B | 29.08.2023 13:50 |
112 | A | 29.08.2023 13:55 |
I need to remove remove repeats in status sequences but only for cases when status repeats in next to each other records. Meaning that id 111 can have multiple statuses == 'A', but I need to remove second 'A' if previous status in time also was 'A'. So new table should be like this (remove second line with same status):
id | status | timestamp |
---|---|---|
111 | A | 29.08.2023 12:39 |
111 | B | 29.08.2023 12:47 |
111 | C | 29.08.2023 12:50 |
111 | A | 29.08.2023 12:50 |
112 | A | 29.08.2023 12:50 |
112 | B | 29.08.2023 13:09 |
112 | C | 29.08.2023 13:40 |
112 | B | 29.08.2023 13:50 |
112 | A | 29.08.2023 13:55 |
So in the end we have a unique sequences of statuses for each id without repeats.
Appreciate any help because I stuck with some very slow and blunt approach comparing each row with previous, but dataset is very huge, > 5M records.
Get difference of multiple columns shifted values and get rows by boolean indexing
:
out = df[df[['id','status']].ne(df[['id','status']].shift()).any(axis=1)]
print (out)
id status timestamp
0 111 A 29.08.2023 12:39
2 111 B 29.08.2023 12:47
3 111 C 29.08.2023 12:50
4 111 A 29.08.2023 12:50
5 112 A 29.08.2023 12:50
6 112 B 29.08.2023 13:09
7 112 C 29.08.2023 13:40
8 112 B 29.08.2023 13:50
9 112 A 29.08.2023 13:55