Search code examples
pythonpandasdataframegroup-by

get minimum value which is greater than some value from column in pandas dataframe


I have a pandas dataframe and would like to create a column min_date_after_ref_date which shows the minimum date after a given ref_date. I have the following code.

ref_date = datetime.strptime('2023-04-21 12:00', '%Y-%m-%d %H:%M')
df = pd.DataFrame({'id':[1,2,1,1,3], 'time_stamp': ['2023-04-19 12:05', '2023-04-21 12:45',
                                                 '2023-04-21 15:45', '2023-04-23 13:15', '2023-04-18 12:05']})
df = df.assign(time_stamp=pd.to_datetime(df.time_stamp))
df = df.assign(min_date_after_ref_date=df[df.time_stamp>ref_date].groupby('id').time_stamp.transform('min'))

and I get this

id  time_stamp  min_date_after_ref_date
0   1   2023-04-19 12:05:00 NaT
1   2   2023-04-21 12:45:00 2023-04-21 12:45:00
2   1   2023-04-21 15:45:00 2023-04-21 15:45:00
3   1   2023-04-23 13:15:00 2023-04-21 15:45:00
4   3   2023-04-18 12:05:00 NaT

but I would like there to be 2023-04-21 15:45:00 also in the first row (and not NaN), so that for each id there is always the same value of min_date_after_ref_date. How can I change this?


Solution

  • Use Series.where for set NaT if condition is False:

    out = df.assign(min_date_after_ref_date=df.time_stamp.where(df.time_stamp>ref_date)
                                               .groupby(df['id'])
                                               .transform('min'))
    print (out)
       id          time_stamp min_date_after_ref_date
    0   1 2023-04-19 12:05:00     2023-04-21 15:45:00
    1   2 2023-04-21 12:45:00     2023-04-21 12:45:00
    2   1 2023-04-21 15:45:00     2023-04-21 15:45:00
    3   1 2023-04-23 13:15:00     2023-04-21 15:45:00
    4   3 2023-04-18 12:05:00                     NaT
    

    Or use Series.map without transform:

    out = df.assign(min_date_after_ref_date=df['id'].map(df[df.time_stamp>ref_date]
                                                         .groupby('id').time_stamp.min()))
    print (df)
       id          time_stamp min_date_after_ref_date
    0   1 2023-04-19 12:05:00     2023-04-21 15:45:00
    1   2 2023-04-21 12:45:00     2023-04-21 12:45:00
    2   1 2023-04-21 15:45:00     2023-04-21 15:45:00
    3   1 2023-04-23 13:15:00     2023-04-21 15:45:00
    4   3 2023-04-18 12:05:00                     NaT