I have a sample of the dataframe as given below.
data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B'],
'Date':['2021-2-13', '2021-2-14', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-15'],
'Modified_Date':['3/19/2021 6:34:20 PM','3/20/2021 4:57:39 PM', '3/21/2021 4:57:40 PM', '3/22/2021 4:57:57 PM', '3/23/2021 4:57:41 PM',
'3/25/2021 11:44:15 PM','3/26/2021 2:16:09 PM', '3/20/2021 2:16:04 PM', '3/21/2021 4:57:40 PM'],
'Steps': [1000, 1200, 1500, 2000, 1400, 4000, 5000,1000, 3500]}
df1 = pd.DataFrame(data)
df1
This data have to be filtered in such a way that first for 'ID', and then for each 'Date', the latest entry of 'Modified_Date' row has to be selected.
EX: For ID=A, For Date='2021-04-14', The latest/last modified date = '3/22/2021 4:57:57 PM', This row has to be selected.
I have attached the snippet of the how the final dataframe has to look like.
I have been stuck on this for a while.
The easiest/most straighforward is to sort by date and take the last per group:
(df1.sort_values(by='Modified_Date')
.groupby(['ID', 'Date'], as_index=False).last()
)
output:
ID Date Modified_Date Steps
0 A 2021-2-13 3/19/2021 6:34:20 PM 1000
1 A 2021-2-14 3/22/2021 4:57:57 PM 2000
2 A 2021-2-15 3/23/2021 4:57:41 PM 1400
3 B 2021-2-14 3/26/2021 2:16:09 PM 5000
4 B 2021-2-15 3/21/2021 4:57:40 PM 3500