I have a sample dataframe from my huge dataframe as shown given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {
'ID':['AAQRB','AAQRB','AAQRB',
'AHXSJ','AHXSJ','AHXSJ','GABOY','GABOY','GABOY','GHZGS','GHZGS','GHZGS'],
'Date':['10/18/2021 10:52:53 PM','10/18/2021 10:53:55 PM', '10/25/2021 5:55:43 PM',
'10/22/2021 10:37:06 PM','10/22/2021 10:38:22 PM','10/22/2021 10:39:56 PM',
'11/1/2021 1:27:15 AM','11/1/2021 1:28:45 AM','11/2/2021 8:53:39 PM',
'10/29/2021 11:13:57 PM', '10/29/2021 11:17:47 PM', '10/29/2021 11:19:15 PM'],
'Race_x':[NaN,NaN,NaN,NaN,NaN,1,NaN,NaN,1, NaN,NaN,1],
'Vaccine':['TRUE',NaN,NaN,'TRUE',NaN,NaN,'TRUE',NaN,NaN,'FALSE',NaN,NaN],
'Study_activity':
[NaN,'continue',NaN,NaN,'continue',NaN,NaN,'continue',NaN,NaN,'continue',NaN],
'Who_Contacted':
[NaN,NaN,'WeContacted',NaN,NaN,'WeContacted',NaN,NaN,NaN,NaN,NaN,'WeContacted']}
test_df = pd.DataFrame(data)
Goal is to get all the first values for each ID and filter the several rows of participant to a single row with all information. The final dataframe should look like the image given below.
CODE TRIED
I tried using the Grouper() function, and the code is given below.
test_df['Date'] = pd.to_datetime(test_df['Date'])
test_df1 = (test_df.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg("first")
.reset_index())
baseline_df = test_df1[~test_df1.duplicated(subset = ['ID'], keep='first')]
But the problem with this is, if I use freq='D', then the Race_x values are missed which are entered the next day. the output looks like the image shown below.
If I use freq='M' or freq='Y', the other values are captured, however the Dates column values are changed, and we get the date of end of month for each ID as shown below.
The final 'Date' column should be the first entry of the 'date' for each ID and it should not change.
Any help is greatly appreciated. Thank you!
Create a virtual column to group by month:
>>> test_df.assign(month=test_df['Date'].dt.strftime('%Y-%m')) \
.groupby(['ID', 'month']).agg('first') \
.droplevel(1).reset_index() \
.assign(Date=lambda x: x['Date'].dt.date)
ID Date Race_x Vaccine Study_activity Who_Contacted
0 AAQRB 2021-10-18 NaN TRUE continue WeContacted
1 AHXSJ 2021-10-22 1.0 TRUE continue WeContacted
2 GABOY 2021-11-01 1.0 TRUE continue None
3 GHZGS 2021-10-29 1.0 FALSE continue WeContacted