Search code examples
pythonsortingpandas-groupbydata-manipulationranking

How to obtain the unique record of a cross joined table based on the dates of two different columns?


I have quite a complex logic to create. I have some client clinic encounter data which has historical testing results, R_DATE_TESTED, R_RESULT mapped to each client (P_CLIENT_ID) for each P_DATE_ENCOUNTER.

RECORD_ID P_CLIENT_ID R_CLIENT_ID P_DATE_ENCOUNTER R_DATE_TESTED R_RESULT
302950 25835 25835.0 2016-12-21 2017-03-07 20.0
302951 25835 25835.0 2016-12-21 2017-08-03 20.0
302952 25835 25835.0 2016-12-21 2018-03-23 20.0
302953 25835 25835.0 2016-12-21 2019-06-28 20.0
302954 25835 25835.0 2016-12-21 2019-08-19 42.0
302955 25835 25835.0 2016-12-21 2020-04-20 40.0
302956 25835 25835.0 2016-12-21 2021-06-03 20.0
302957 25835 25835.0 2017-02-21 2017-03-07 20.0
302958 25835 25835.0 2017-02-21 2017-08-03 20.0
302959 25835 25835.0 2017-02-21 2018-03-23 20.0
302960 25835 25835.0 2017-02-21 2019-06-28 20.0
302961 25835 25835.0 2017-02-21 2019-08-19 42.0
302962 25835 25835.0 2017-02-21 2020-04-20 40.0
302963 25835 25835.0 2017-02-21 2021-06-03 20.0
302964 25835 25835.0 2017-04-25 2017-03-07 20.0
302965 25835 25835.0 2017-04-25 2017-08-03 20.0
302966 25835 25835.0 2017-04-25 2018-03-23 20.0
302967 25835 25835.0 2017-04-25 2019-06-28 20.0
302968 25835 25835.0 2017-04-25 2019-08-19 42.0
302969 25835 25835.0 2017-04-25 2020-04-20 40.0
302970 25835 25835.0 2017-04-25 2021-06-03 20.0
302971 25835 25835.0 2017-06-21 2017-03-07 20.0
302972 25835 25835.0 2017-06-21 2017-08-03 20.0
302973 25835 25835.0 2017-06-21 2018-03-23 20.0
302974 25835 25835.0 2017-06-21 2019-06-28 20.0
302975 25835 25835.0 2017-06-21 2019-08-19 42.0
302976 25835 25835.0 2017-06-21 2020-04-20 40.0
302977 25835 25835.0 2017-06-21 2021-06-03 20.0
302978 25835 25835.0 2017-09-04 2017-03-07 20.0
302979 25835 25835.0 2017-09-04 2017-08-03 20.0
302980 25835 25835.0 2017-09-04 2018-03-23 20.0
302981 25835 25835.0 2017-09-04 2019-06-28 20.0
302982 25835 25835.0 2017-09-04 2019-08-19 42.0
302983 25835 25835.0 2017-09-04 2020-04-20 40.0
302984 25835 25835.0 2017-09-04 2021-06-03 20.0
302985 25835 25835.0 2018-01-08 2017-03-07 20.0
302986 25835 25835.0 2018-01-08 2017-08-03 20.0
302987 25835 25835.0 2018-01-08 2018-03-23 20.0
302988 25835 25835.0 2018-01-08 2019-06-28 20.0
302989 25835 25835.0 2018-01-08 2019-08-19 42.0
302990 25835 25835.0 2018-01-08 2020-04-20 40.0
302991 25835 25835.0 2018-01-08 2021-06-03 20.0
302992 25835 25835.0 2018-04-03 2017-03-07 20.0
302993 25835 25835.0 2018-04-03 2017-08-03 20.0
302994 25835 25835.0 2018-04-03 2018-03-23 20.0
302995 25835 25835.0 2018-04-03 2019-06-28 20.0
302996 25835 25835.0 2018-04-03 2019-08-19 42.0
302997 25835 25835.0 2018-04-03 2020-04-20 40.0
302998 25835 25835.0 2018-04-03 2021-06-03 20.0
302999 25835 25835.0 2018-07-25 2017-03-07 20.0
303000 25835 25835.0 2018-07-25 2017-08-03 20.0
303001 25835 25835.0 2018-07-25 2018-03-23 20.0
303002 25835 25835.0 2018-07-25 2019-06-28 20.0
303003 25835 25835.0 2018-07-25 2019-08-19 42.0
303004 25835 25835.0 2018-07-25 2020-04-20 40.0
303005 25835 25835.0 2018-07-25 2021-06-03 20.0

The data is already sorted. How can I obtain the unique record for each Client encounter (group P_CLIENT_ID AND P_DATE_ENCOUNTER) where R_DATE_TESTED < R_DATE_ENCOUNTER (but the most recent of that). Moreover if R_DATE_TESTED < R_DATE_ENCOUNTER is not true; it returns nulls

The result from logic should be as below:

P_CLIENT_ID R_CLIENT_ID P_DATE_ENCOUNTER R_DATE_TESTED R_RESULT
25835 25835.0 2016-12-21 NaN NaN
25835 25835.0 2017-02-21 NaN NaN
25835 25835.0 2017-04-25 2017-03-07 20.0
25835 25835.0 2017-06-21 2017-03-07 20.0
25835 25835.0 2017-09-04 2017-08-03 20.0
25835 25835.0 2018-01-08 2017-08-03 20.0
25835 25835.0 2018-04-03 2018-03-23 20.0

The idea is that for each P_CLIENT_ID, each P_ENCOUNTER_ID is returned with it's most recent previous R_RESULT(latest result prior to the encounter). If a CLIENT has no Results which are before the P_DATE_ENCOUNTER, i.e. (R_DATE_TESTED is not < P_DATE_ENCOUNTERED) then it returns nulls for those columns (As can be seen for the first two records). I thought maybe a combinination of using some ranking over the partition and .ffill() but really am stuck.


Solution

  • Yuo can use this code:

    import numpy as np
    
    # df - your DataFrame
    
    group = df.groupby(['P_CLIENT_ID', 'P_DATE_ENCOUNTER'])
    
    def foo(df):
        result = df.loc[df.P_DATE_ENCOUNTER>df.R_DATE_TESTED, ['R_DATE_TESTED', 'R_RESULT']].tail(1).reset_index()
        if not result.empty:
            return result
        else:
            return pd.DataFrame([[np.nan, np.nan, np.nan]], columns=['RECORD_ID','R_DATE_TESTED', 'R_RESULT'])
    
    
    group.apply(foo)