I'm new to pandas. So I have dataframe that looks like that:
id car date color
1 2 bmw 2021-05-21 black
2 3 bmw 2021-05-21 yellow
3 4 mercedes 2021-06-21 red
4 5 toyota 2021-11-01 pink
5 6 toyota 2021-09-06 black
6 7 mercedes 2021-07-07 white
I need to choose rows that have unique "car" with the latest date and color in priority (red, pink, black, white, yellow), by that i mean if car in dataframe sorted by date has color red, then i save that line, if car doesn't have red color then i search for pink color & etc.
so firstly - sort by date secondly - if date is duplicated, i sort it by needed color
i did sorting by date:
df.sort_values(by="date").drop_duplicates(subset=["car", "color"], keep="last")
and it looks like that rn:
id car date color
1 2 bmw 2021-05-21 black
2 3 bmw 2021-05-21 yellow
3 4 mercedes 2021-06-21 red
6 7 mercedes 2021-07-07 white
5 6 toyota 2021-09-06 black
4 5 toyota 2021-11-01 pink
what i actually wanna see:
id car date color
2 bmw 2021-05-21 black
5 toyota 2021-11-01 pink
7 mercedes 2021-07-07 white
IIUC, use pd.Categorical
to define your order, then sort_values
+ groupby
.
df["color"] = pd.Categorical(
df["color"], categories=["red", "pink", "black", "white", "yellow"], ordered=True
)
df.sort_values(by=["date", "color"], ascending=[0, 1]).groupby(
"car", as_index=0
).first()
car id date color
0 bmw 2 2021-05-21 black
1 mercedes 7 2021-07-07 white
2 toyota 5 2021-11-01 pink