Search code examples
pythonpandassortinggroup-bydrop-duplicates

How to group by first column, select latest value of second column, and all respective values of third column


I have a df:

    {'ID': {0: 'A',
  1: 'A',
  2: 'A',
  3: 'B',
  4: 'B',
  5: 'B',
  6: 'C',
  7: 'C',
  8: 'C',
  9: 'C'},
 'Date': {0: Timestamp('2020-03-02 00:00:00'),
  1: Timestamp('2021-04-03 00:00:00'),
  2: Timestamp('2021-04-03 00:00:00'),
  3: Timestamp('2022-06-05 00:00:00'),
  4: Timestamp('2022-06-05 00:00:00'),
  5: Timestamp('2021-04-03 00:00:00'),
  6: Timestamp('2019-03-02 00:00:00'),
  7: Timestamp('2019-03-02 00:00:00'),
  8: Timestamp('2017-06-01 00:00:00'),
  9: Timestamp('2017-06-01 00:00:00')},
 'Check': {0: 'In',
  1: 'In',
  2: 'Out',
  3: 'Out',
  4: 'Out',
  5: 'In',
  6: 'In ',
  7: 'Out',
  8: 'Out',
  9: 'Out'}}

For each ID, I want to get the latest Date and all values of Check related to that latest date.
My expected result:

    {'ID': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'C'},
 'Date': {0: Timestamp('2021-04-03 00:00:00'),
  1: Timestamp('2021-04-03 00:00:00'),
  2: Timestamp('2022-06-05 00:00:00'),
  3: Timestamp('2019-03-02 00:00:00'),
  4: Timestamp('2019-03-02 00:00:00')},
 'Check': {0: 'In', 1: 'Out', 2: 'Out', 3: 'In ', 4: 'Out'}}

I tried the below script and got the latest date with only first value of Check for each ID:

new_df= (
df.sort_values(["ID", "Date"], ascending=[False, False])
.drop_duplicates(subset = ["ID"])
.reset_index(drop=True)
) 

 new_df

    ID  Date        Check
0   C   2019-03-02  In
1   B   2022-06-05  Out
2   A   2021-04-03  In

Any suggestions?


Solution

  • one option is with a groupby:

    (df1
    .drop_duplicates()
    .merge(
        df1.groupby('ID').Date.max(), 
        on = ['ID', 'Date'])
    )
      ID       Date Check
    0  A 2021-04-03    In
    1  A 2021-04-03   Out
    2  B 2022-06-05   Out
    3  C 2019-03-02   In
    4  C 2019-03-02   Out