I have data that operates like it has been grouped. My goal is to keep only the most recent date for each name exclusively for location 'Home', in other words I want to know the most recent date that each person was 'Home'.
Dataframe df:
ID Date01 Location01 Date02 Location02 Date03 Location03
0 jane 1/01/2023 Home 2/24/2024 Home
1 sarah 2/03/2024 Home
2 ricardo 2/05/2024 Home 4/28/2024 Office 5/03/2024 Home
3 thomas 3/23/2024 Home
4 lee 4/01/2024 Home 5/07/2024 Satellite
5 patrick 1/02/2024 Home
6 anya 3/05/2022 Office 4/19/2024 Home
I want the output:
ID Date Location
0 jane 1/01/2023 Home
1 jane 2/24/2024 Home
2 sarah 2/03/2024 Home
3 ricardo 2/05/2024 Home
4 ricardo 4/28/2024 Office
5 ricardo 5/03/2024 Home
6 thomas 3/23/2024 Home
7 lee 4/01/2024 Home
8 lee 5/07/2024 Satellite
9 patrick 1/02/2024 Home
10 anya 3/05/2022 Office
11 anya 4/19/2024 Home
So that I can remove all that isn't Location 'Home', and then remove everything except the most recent date per ID, leaving the final output:
ID Date Location
0 jane 2/24/2024 Home
1 sarah 2/03/2024 Home
2 ricardo 5/03/2024 Home
3 thomas 3/23/2024 Home
4 lee 4/01/2024 Home
5 patrick 1/02/2024 Home
6 anya 4/19/2024 Home
I think maybe I need to set indexes for the sub-object Dates and stack but I'm pretty stuck. Does anyone have any ideas on where to begin with this?
Thank you!
There is an useful function in pandas 2.2.2 called lreshape
for this.
You can do:
locs = df.filter(regex="^Location").columns.tolist()
dates = df.filter(regex="^Date").columns.tolist()
out = pd.lreshape(df, {'Location': locs, 'Date': dates})
out = out[out["Location"].eq("Home")].groupby("ID", as_index=False).agg(max)
print(out)
ID Location Date
0 anya Home 4/19/2024
1 jane Home 2/24/2024
2 lee Home 4/01/2024
3 patrick Home 1/02/2024
4 ricardo Home 5/03/2024
5 sarah Home 2/03/2024
6 thomas Home 3/23/2024