I have quite a complex logic to create. I have some client clinic encounter data which has historical testing results, R_DATE_TESTED
, R_RESULT
mapped to each client (P_CLIENT_ID
) for each P_DATE_ENCOUNTER
.
RECORD_ID | P_CLIENT_ID | R_CLIENT_ID | P_DATE_ENCOUNTER | R_DATE_TESTED | R_RESULT |
---|---|---|---|---|---|
302950 | 25835 | 25835.0 | 2016-12-21 | 2017-03-07 | 20.0 |
302951 | 25835 | 25835.0 | 2016-12-21 | 2017-08-03 | 20.0 |
302952 | 25835 | 25835.0 | 2016-12-21 | 2018-03-23 | 20.0 |
302953 | 25835 | 25835.0 | 2016-12-21 | 2019-06-28 | 20.0 |
302954 | 25835 | 25835.0 | 2016-12-21 | 2019-08-19 | 42.0 |
302955 | 25835 | 25835.0 | 2016-12-21 | 2020-04-20 | 40.0 |
302956 | 25835 | 25835.0 | 2016-12-21 | 2021-06-03 | 20.0 |
302957 | 25835 | 25835.0 | 2017-02-21 | 2017-03-07 | 20.0 |
302958 | 25835 | 25835.0 | 2017-02-21 | 2017-08-03 | 20.0 |
302959 | 25835 | 25835.0 | 2017-02-21 | 2018-03-23 | 20.0 |
302960 | 25835 | 25835.0 | 2017-02-21 | 2019-06-28 | 20.0 |
302961 | 25835 | 25835.0 | 2017-02-21 | 2019-08-19 | 42.0 |
302962 | 25835 | 25835.0 | 2017-02-21 | 2020-04-20 | 40.0 |
302963 | 25835 | 25835.0 | 2017-02-21 | 2021-06-03 | 20.0 |
302964 | 25835 | 25835.0 | 2017-04-25 | 2017-03-07 | 20.0 |
302965 | 25835 | 25835.0 | 2017-04-25 | 2017-08-03 | 20.0 |
302966 | 25835 | 25835.0 | 2017-04-25 | 2018-03-23 | 20.0 |
302967 | 25835 | 25835.0 | 2017-04-25 | 2019-06-28 | 20.0 |
302968 | 25835 | 25835.0 | 2017-04-25 | 2019-08-19 | 42.0 |
302969 | 25835 | 25835.0 | 2017-04-25 | 2020-04-20 | 40.0 |
302970 | 25835 | 25835.0 | 2017-04-25 | 2021-06-03 | 20.0 |
302971 | 25835 | 25835.0 | 2017-06-21 | 2017-03-07 | 20.0 |
302972 | 25835 | 25835.0 | 2017-06-21 | 2017-08-03 | 20.0 |
302973 | 25835 | 25835.0 | 2017-06-21 | 2018-03-23 | 20.0 |
302974 | 25835 | 25835.0 | 2017-06-21 | 2019-06-28 | 20.0 |
302975 | 25835 | 25835.0 | 2017-06-21 | 2019-08-19 | 42.0 |
302976 | 25835 | 25835.0 | 2017-06-21 | 2020-04-20 | 40.0 |
302977 | 25835 | 25835.0 | 2017-06-21 | 2021-06-03 | 20.0 |
302978 | 25835 | 25835.0 | 2017-09-04 | 2017-03-07 | 20.0 |
302979 | 25835 | 25835.0 | 2017-09-04 | 2017-08-03 | 20.0 |
302980 | 25835 | 25835.0 | 2017-09-04 | 2018-03-23 | 20.0 |
302981 | 25835 | 25835.0 | 2017-09-04 | 2019-06-28 | 20.0 |
302982 | 25835 | 25835.0 | 2017-09-04 | 2019-08-19 | 42.0 |
302983 | 25835 | 25835.0 | 2017-09-04 | 2020-04-20 | 40.0 |
302984 | 25835 | 25835.0 | 2017-09-04 | 2021-06-03 | 20.0 |
302985 | 25835 | 25835.0 | 2018-01-08 | 2017-03-07 | 20.0 |
302986 | 25835 | 25835.0 | 2018-01-08 | 2017-08-03 | 20.0 |
302987 | 25835 | 25835.0 | 2018-01-08 | 2018-03-23 | 20.0 |
302988 | 25835 | 25835.0 | 2018-01-08 | 2019-06-28 | 20.0 |
302989 | 25835 | 25835.0 | 2018-01-08 | 2019-08-19 | 42.0 |
302990 | 25835 | 25835.0 | 2018-01-08 | 2020-04-20 | 40.0 |
302991 | 25835 | 25835.0 | 2018-01-08 | 2021-06-03 | 20.0 |
302992 | 25835 | 25835.0 | 2018-04-03 | 2017-03-07 | 20.0 |
302993 | 25835 | 25835.0 | 2018-04-03 | 2017-08-03 | 20.0 |
302994 | 25835 | 25835.0 | 2018-04-03 | 2018-03-23 | 20.0 |
302995 | 25835 | 25835.0 | 2018-04-03 | 2019-06-28 | 20.0 |
302996 | 25835 | 25835.0 | 2018-04-03 | 2019-08-19 | 42.0 |
302997 | 25835 | 25835.0 | 2018-04-03 | 2020-04-20 | 40.0 |
302998 | 25835 | 25835.0 | 2018-04-03 | 2021-06-03 | 20.0 |
302999 | 25835 | 25835.0 | 2018-07-25 | 2017-03-07 | 20.0 |
303000 | 25835 | 25835.0 | 2018-07-25 | 2017-08-03 | 20.0 |
303001 | 25835 | 25835.0 | 2018-07-25 | 2018-03-23 | 20.0 |
303002 | 25835 | 25835.0 | 2018-07-25 | 2019-06-28 | 20.0 |
303003 | 25835 | 25835.0 | 2018-07-25 | 2019-08-19 | 42.0 |
303004 | 25835 | 25835.0 | 2018-07-25 | 2020-04-20 | 40.0 |
303005 | 25835 | 25835.0 | 2018-07-25 | 2021-06-03 | 20.0 |
The data is already sorted. How can I obtain the unique record for each Client encounter (group P_CLIENT_ID AND P_DATE_ENCOUNTER
) where R_DATE_TESTED < R_DATE_ENCOUNTER
(but the most recent of that). Moreover if R_DATE_TESTED < R_DATE_ENCOUNTER
is not true; it returns nulls
The result from logic should be as below:
P_CLIENT_ID | R_CLIENT_ID | P_DATE_ENCOUNTER | R_DATE_TESTED | R_RESULT |
---|---|---|---|---|
25835 | 25835.0 | 2016-12-21 | NaN | NaN |
25835 | 25835.0 | 2017-02-21 | NaN | NaN |
25835 | 25835.0 | 2017-04-25 | 2017-03-07 | 20.0 |
25835 | 25835.0 | 2017-06-21 | 2017-03-07 | 20.0 |
25835 | 25835.0 | 2017-09-04 | 2017-08-03 | 20.0 |
25835 | 25835.0 | 2018-01-08 | 2017-08-03 | 20.0 |
25835 | 25835.0 | 2018-04-03 | 2018-03-23 | 20.0 |
The idea is that for each P_CLIENT_ID
, each P_ENCOUNTER_ID
is returned with it's most recent previous R_RESULT
(latest result prior to the encounter). If a CLIENT has no Results which are before the P_DATE_ENCOUNTER
, i.e. (R_DATE_TESTED
is not < P_DATE_ENCOUNTERED
) then it returns nulls for those columns (As can be seen for the first two records). I thought maybe a combinination of using some ranking over the partition and .ffill()
but really am stuck.
Yuo can use this code:
import numpy as np
# df - your DataFrame
group = df.groupby(['P_CLIENT_ID', 'P_DATE_ENCOUNTER'])
def foo(df):
result = df.loc[df.P_DATE_ENCOUNTER>df.R_DATE_TESTED, ['R_DATE_TESTED', 'R_RESULT']].tail(1).reset_index()
if not result.empty:
return result
else:
return pd.DataFrame([[np.nan, np.nan, np.nan]], columns=['RECORD_ID','R_DATE_TESTED', 'R_RESULT'])
group.apply(foo)