I have a data set of patient observations (Obs) (e.g Blood pressure, Heart rate, Resp Rate etc (this is isn't exhaustive and can change to I would need to generate this list by pulling unique values from the Obs column).
Currently each row of data represents a value that corresponds to a time point that the recording should have been taken, the time the recording was taken and the patient visit ID.
I want to rearrange so that all the observations (the unique vales from the original Obs column) taken for each patient at each time point appear on one line. Sometimes two values are taken for an observation at a timepoints - in which case I want to take the later.Sometimes that observation isn't recorded at that time point in which case return NaN (or other Null value).
The datafile if 22GB so to do the as efficiently as possible would be appreciated!
Sample data:
Index, VisitID, Obs, Obs_DTM, Entered_DTM, Value
0, 33, BP, 2018-11-06 20:30:00, 2018-11-06 20:31:08, 120
1, 33, HR, 2018-11-06 20:30:00, 2018-11-06 20:31:12, 98
2, 33, SPO2, 2018-11-06 20:30:00, 2018-11-06 20:31:14, 99
3, 33, RR, 2018-11-06 20:30:00, 2018-11-06 20:31:10, 104
4, 33, RR, 2018-11-06 20:30:00, 2018-11-06 20:32:00, 22
5, 33, RR, 2018-07-23 21:28:00, 2018-07-23 21:32:10, 20
6, 33, BP, 2018-07-23 21:28:00, 2018-07-23 21:32:15, 102
7, 34, BP, 2018-07-25 20:32:00, 2018-07-25 21:42:07, 98
8, 34, GCS, 2018-07-25 20:32:00, 2018-07-25 21:42:10, 12
9, 34, HR, 2018-07-25 20:32:00, 2018-07-25 21:41:58, 99
10, 34, SpO2,2018-07-25 20:32:00, 2018-07-25 21:42:15, 89
11, 34, RR, 2018-07-25 20:32:00, 2018-07-25 21:42:12, 13
12, 34, GCS, 2018-07-25 22:54:00, 2018-07-25 22:52:00, 14
Return
Index, VisitID, Obs_DTM, BP_Value, BP_DTM, HR_Value, HR_DTM, SPO2_Value, SPO2_DTM, RR_Value, RR_DTM, GCS_Value, GCS_DTM
0, 33, 2018-11-06 20:30:00,120, 2018-11-06 20:31:08, 98, 2018-11-06 20:31:12, 99, 2018-11-06 20:31:14, 22, 2018-11-06 20:32:00, NaN, NaN
1, 33, 2018-07-23 21:28:00, 102, 2018-07-23 21:32:15, NaN, NaN, NaN, NaN, 2018-07-23 21:32:10, 20, NaN, NaN
2, 34, 2018-07-25 20:32:00, 98, 2018-07-25 21:42:07, 99, 2018-07-25 21:41:58, 89, 2018-07-25 21:42:15, 13, 2018-07-25 21:42:12, 2018-07-25 21:42:10, 12
3, 34, 2018-07-25 22:54:00, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, 14, 2018-07-25 22:52:00
We can do:
#only if index is a columns
#df = df.set_index('Index')
new_df = df.pivot_table(index=['VisitID', 'Obs_DTM'],
columns='Obs',
aggfunc='first')
new_df = new_df.set_axis([f'{y}_{x}' for x, y in new_df.columns], axis=1).reset_index()