Search code examples
pythonpandassubobject

Ungrouping data by indexing sub-objects and stacking in pandas


I have data that operates like it has been grouped. My goal is to keep only the most recent date for each name exclusively for location 'Home', in other words I want to know the most recent date that each person was 'Home'.

Dataframe df:

       ID           Date01     Location01 Date02     Location02  Date03     Location03
0      jane         1/01/2023  Home       2/24/2024  Home
1      sarah        2/03/2024  Home        
2      ricardo      2/05/2024  Home       4/28/2024  Office      5/03/2024  Home
3      thomas       3/23/2024  Home         
4      lee          4/01/2024  Home       5/07/2024  Satellite    
5      patrick      1/02/2024  Home
6      anya         3/05/2022  Office     4/19/2024  Home          

I want the output:

       ID           Date      Location
0      jane         1/01/2023 Home
1      jane         2/24/2024 Home  
2      sarah        2/03/2024 Home         
3      ricardo      2/05/2024 Home
4      ricardo      4/28/2024 Office   
5      ricardo      5/03/2024 Home
6      thomas       3/23/2024 Home          
7      lee          4/01/2024 Home   
8      lee          5/07/2024 Satellite     
9      patrick      1/02/2024 Home   
10     anya         3/05/2022 Office    
11     anya         4/19/2024 Home          

So that I can remove all that isn't Location 'Home', and then remove everything except the most recent date per ID, leaving the final output:

       ID           Date      Location
0      jane         2/24/2024 Home    
1      sarah        2/03/2024 Home              
2      ricardo      5/03/2024 Home
3      thomas       3/23/2024 Home              
4      lee          4/01/2024 Home     
5      patrick      1/02/2024 Home     
6      anya         4/19/2024 Home

I think maybe I need to set indexes for the sub-object Dates and stack but I'm pretty stuck. Does anyone have any ideas on where to begin with this?

Thank you!


Solution

  • There is an useful function in pandas 2.2.2 called lreshape for this.

    You can do:

    locs = df.filter(regex="^Location").columns.tolist()
    dates = df.filter(regex="^Date").columns.tolist()
    out = pd.lreshape(df, {'Location': locs, 'Date': dates})
    out = out[out["Location"].eq("Home")].groupby("ID", as_index=False).agg(max)
    print(out)
    
            ID Location       Date
    0     anya     Home  4/19/2024
    1     jane     Home  2/24/2024
    2      lee     Home  4/01/2024
    3  patrick     Home  1/02/2024
    4  ricardo     Home  5/03/2024
    5    sarah     Home  2/03/2024
    6   thomas     Home  3/23/2024