python sorting pandas-groupby data-manipulation ranking

How to obtain the unique record of a cross joined table based on the dates of two different columns?

I have quite a complex logic to create. I have some client clinic encounter data which has historical testing results, R_DATE_TESTED, R_RESULT mapped to each client (P_CLIENT_ID) for each P_DATE_ENCOUNTER.

RECORD_ID	P_CLIENT_ID	R_CLIENT_ID	P_DATE_ENCOUNTER	R_DATE_TESTED	R_RESULT
302950	25835	25835.0	2016-12-21	2017-03-07	20.0
302951	25835	25835.0	2016-12-21	2017-08-03	20.0
302952	25835	25835.0	2016-12-21	2018-03-23	20.0
302953	25835	25835.0	2016-12-21	2019-06-28	20.0
302954	25835	25835.0	2016-12-21	2019-08-19	42.0
302955	25835	25835.0	2016-12-21	2020-04-20	40.0
302956	25835	25835.0	2016-12-21	2021-06-03	20.0
302957	25835	25835.0	2017-02-21	2017-03-07	20.0
302958	25835	25835.0	2017-02-21	2017-08-03	20.0
302959	25835	25835.0	2017-02-21	2018-03-23	20.0
302960	25835	25835.0	2017-02-21	2019-06-28	20.0
302961	25835	25835.0	2017-02-21	2019-08-19	42.0
302962	25835	25835.0	2017-02-21	2020-04-20	40.0
302963	25835	25835.0	2017-02-21	2021-06-03	20.0
302964	25835	25835.0	2017-04-25	2017-03-07	20.0
302965	25835	25835.0	2017-04-25	2017-08-03	20.0
302966	25835	25835.0	2017-04-25	2018-03-23	20.0
302967	25835	25835.0	2017-04-25	2019-06-28	20.0
302968	25835	25835.0	2017-04-25	2019-08-19	42.0
302969	25835	25835.0	2017-04-25	2020-04-20	40.0
302970	25835	25835.0	2017-04-25	2021-06-03	20.0
302971	25835	25835.0	2017-06-21	2017-03-07	20.0
302972	25835	25835.0	2017-06-21	2017-08-03	20.0
302973	25835	25835.0	2017-06-21	2018-03-23	20.0
302974	25835	25835.0	2017-06-21	2019-06-28	20.0
302975	25835	25835.0	2017-06-21	2019-08-19	42.0
302976	25835	25835.0	2017-06-21	2020-04-20	40.0
302977	25835	25835.0	2017-06-21	2021-06-03	20.0
302978	25835	25835.0	2017-09-04	2017-03-07	20.0
302979	25835	25835.0	2017-09-04	2017-08-03	20.0
302980	25835	25835.0	2017-09-04	2018-03-23	20.0
302981	25835	25835.0	2017-09-04	2019-06-28	20.0
302982	25835	25835.0	2017-09-04	2019-08-19	42.0
302983	25835	25835.0	2017-09-04	2020-04-20	40.0
302984	25835	25835.0	2017-09-04	2021-06-03	20.0
302985	25835	25835.0	2018-01-08	2017-03-07	20.0
302986	25835	25835.0	2018-01-08	2017-08-03	20.0
302987	25835	25835.0	2018-01-08	2018-03-23	20.0
302988	25835	25835.0	2018-01-08	2019-06-28	20.0
302989	25835	25835.0	2018-01-08	2019-08-19	42.0
302990	25835	25835.0	2018-01-08	2020-04-20	40.0
302991	25835	25835.0	2018-01-08	2021-06-03	20.0
302992	25835	25835.0	2018-04-03	2017-03-07	20.0
302993	25835	25835.0	2018-04-03	2017-08-03	20.0
302994	25835	25835.0	2018-04-03	2018-03-23	20.0
302995	25835	25835.0	2018-04-03	2019-06-28	20.0
302996	25835	25835.0	2018-04-03	2019-08-19	42.0
302997	25835	25835.0	2018-04-03	2020-04-20	40.0
302998	25835	25835.0	2018-04-03	2021-06-03	20.0
302999	25835	25835.0	2018-07-25	2017-03-07	20.0
303000	25835	25835.0	2018-07-25	2017-08-03	20.0
303001	25835	25835.0	2018-07-25	2018-03-23	20.0
303002	25835	25835.0	2018-07-25	2019-06-28	20.0
303003	25835	25835.0	2018-07-25	2019-08-19	42.0
303004	25835	25835.0	2018-07-25	2020-04-20	40.0
303005	25835	25835.0	2018-07-25	2021-06-03	20.0

The data is already sorted. How can I obtain the unique record for each Client encounter (group P_CLIENT_ID AND P_DATE_ENCOUNTER) where R_DATE_TESTED < R_DATE_ENCOUNTER (but the most recent of that). Moreover if R_DATE_TESTED < R_DATE_ENCOUNTER is not true; it returns nulls

The result from logic should be as below:

P_CLIENT_ID	R_CLIENT_ID	P_DATE_ENCOUNTER	R_DATE_TESTED	R_RESULT
25835	25835.0	2016-12-21	NaN	NaN
25835	25835.0	2017-02-21	NaN	NaN
25835	25835.0	2017-04-25	2017-03-07	20.0
25835	25835.0	2017-06-21	2017-03-07	20.0
25835	25835.0	2017-09-04	2017-08-03	20.0
25835	25835.0	2018-01-08	2017-08-03	20.0
25835	25835.0	2018-04-03	2018-03-23	20.0

The idea is that for each P_CLIENT_ID, each P_ENCOUNTER_ID is returned with it's most recent previous R_RESULT(latest result prior to the encounter). If a CLIENT has no Results which are before the P_DATE_ENCOUNTER, i.e. (R_DATE_TESTED is not < P_DATE_ENCOUNTERED) then it returns nulls for those columns (As can be seen for the first two records). I thought maybe a combinination of using some ranking over the partition and .ffill() but really am stuck.

Solution

Yuo can use this code:

import numpy as np

# df - your DataFrame

group = df.groupby(['P_CLIENT_ID', 'P_DATE_ENCOUNTER'])

def foo(df):
    result = df.loc[df.P_DATE_ENCOUNTER>df.R_DATE_TESTED, ['R_DATE_TESTED', 'R_RESULT']].tail(1).reset_index()
    if not result.empty:
        return result
    else:
        return pd.DataFrame([[np.nan, np.nan, np.nan]], columns=['RECORD_ID','R_DATE_TESTED', 'R_RESULT'])


group.apply(foo)