Search code examples
pythonpandasdataframedata-sciencedata-analysis

pd.Grouper() when applied on datetime, changes the original column of dates


I have a sample dataframe from my huge dataframe as shown given below.

import pandas as pd
import numpy as np

NaN = np.nan
data = {
    'ID':['AAQRB','AAQRB','AAQRB', 
'AHXSJ','AHXSJ','AHXSJ','GABOY','GABOY','GABOY','GHZGS','GHZGS','GHZGS'],
    'Date':['10/18/2021  10:52:53 PM','10/18/2021  10:53:55 PM', '10/25/2021  5:55:43 PM',
           '10/22/2021  10:37:06 PM','10/22/2021  10:38:22 PM','10/22/2021  10:39:56 PM',
           '11/1/2021  1:27:15 AM','11/1/2021  1:28:45 AM','11/2/2021  8:53:39 PM',
           '10/29/2021  11:13:57 PM', '10/29/2021  11:17:47 PM', '10/29/2021  11:19:15 PM'], 
    'Race_x':[NaN,NaN,NaN,NaN,NaN,1,NaN,NaN,1, NaN,NaN,1],
    'Vaccine':['TRUE',NaN,NaN,'TRUE',NaN,NaN,'TRUE',NaN,NaN,'FALSE',NaN,NaN],
    'Study_activity': 
   [NaN,'continue',NaN,NaN,'continue',NaN,NaN,'continue',NaN,NaN,'continue',NaN],
    'Who_Contacted': 
   [NaN,NaN,'WeContacted',NaN,NaN,'WeContacted',NaN,NaN,NaN,NaN,NaN,'WeContacted']}

test_df = pd.DataFrame(data)

Goal is to get all the first values for each ID and filter the several rows of participant to a single row with all information. The final dataframe should look like the image given below.

enter image description here

CODE TRIED

I tried using the Grouper() function, and the code is given below.

test_df['Date'] = pd.to_datetime(test_df['Date'])

test_df1 = (test_df.groupby(['ID', pd.Grouper(key='Date', freq='D')])
   .agg("first")
   .reset_index())
 baseline_df = test_df1[~test_df1.duplicated(subset = ['ID'], keep='first')]

But the problem with this is, if I use freq='D', then the Race_x values are missed which are entered the next day. the output looks like the image shown below.

enter image description here

If I use freq='M' or freq='Y', the other values are captured, however the Dates column values are changed, and we get the date of end of month for each ID as shown below.

enter image description here

The final 'Date' column should be the first entry of the 'date' for each ID and it should not change.

Any help is greatly appreciated. Thank you!


Solution

  • Create a virtual column to group by month:

    >>> test_df.assign(month=test_df['Date'].dt.strftime('%Y-%m')) \
               .groupby(['ID', 'month']).agg('first') \
               .droplevel(1).reset_index() \
               .assign(Date=lambda x: x['Date'].dt.date)
    
          ID        Date  Race_x Vaccine Study_activity Who_Contacted
    0  AAQRB  2021-10-18     NaN    TRUE       continue   WeContacted
    1  AHXSJ  2021-10-22     1.0    TRUE       continue   WeContacted
    2  GABOY  2021-11-01     1.0    TRUE       continue          None
    3  GHZGS  2021-10-29     1.0   FALSE       continue   WeContacted