Search code examples
pythonpandasdataframesortingdrop-duplicates

pandas select rows with condition in priority order


I'm new to pandas. So I have dataframe that looks like that:

    id  car date    color
1   2   bmw 2021-05-21  black
2   3   bmw 2021-05-21  yellow
3   4   mercedes    2021-06-21  red
4   5   toyota  2021-11-01  pink
5   6   toyota  2021-09-06  black
6   7   mercedes    2021-07-07  white

I need to choose rows that have unique "car" with the latest date and color in priority (red, pink, black, white, yellow), by that i mean if car in dataframe sorted by date has color red, then i save that line, if car doesn't have red color then i search for pink color & etc.

so firstly - sort by date secondly - if date is duplicated, i sort it by needed color

i did sorting by date: df.sort_values(by="date").drop_duplicates(subset=["car", "color"], keep="last")

and it looks like that rn:

    id  car date    color
1   2   bmw 2021-05-21  black
2   3   bmw 2021-05-21  yellow
3   4   mercedes    2021-06-21  red
6   7   mercedes    2021-07-07  white
5   6   toyota  2021-09-06  black
4   5   toyota  2021-11-01  pink

what i actually wanna see:

id  car date    color
2   bmw 2021-05-21  black
5   toyota  2021-11-01  pink
7   mercedes    2021-07-07  white

Solution

  • IIUC, use pd.Categorical to define your order, then sort_values + groupby.

    df["color"] = pd.Categorical(
        df["color"], categories=["red", "pink", "black", "white", "yellow"], ordered=True
    )
    
    df.sort_values(by=["date", "color"], ascending=[0, 1]).groupby(
        "car", as_index=0
    ).first()
    

            car  id       date  color
    0       bmw   2 2021-05-21  black
    1  mercedes   7 2021-07-07  white
    2    toyota   5 2021-11-01   pink