I have a Pandas Dataframe that contains multiple findings for an individual (medical history) and I want to have an individuals history all collapsed in one row but with the order retained but at the appointment date level, say all findings/examination results form their past formatted in wide format.
I don't know how best to do this as all groupby
methods require me to provide an agg
that then just combines all columns into one column via join as opposed to new separate columns for a given appointment in the past.
Some columns will not be pivoted
or used as groupby
index (patientId, apptDate, age, bmi
)
A consideration is how to best deal with ordering the created medical history mh_
columns so that records are populated in the lower generated columns first mh_result1
etc.
Original DF
| patientId | apptDate | age | bmi | examinationId | result | category | comment |
|-----------|------------|-----|-----|---------------|------------|----------------|---------------------------------------------|
| 1 | 2024-07-08 | 45 | 22 | 45 | Long Term | Cardiovascular | Cardiovascular defect finding, fup required |
| 1 | 2024-02-01 | 45 | 22 | 33 | None | None | None |
| 1 | 2023-11-14 | 45 | 22 | 12 | Short Term | Respiratory | Shortness of breath, med prescribed |
| 2 | 2023-12-23 | 32 | 12 | 18 | Short Term | Gastro | Recorded malnutrition |
| 2 | 2022-12-11 | 32 | 13 | 21 | Short Term | Gastro | None |
Desired DF
| patientId | apptDate | age | bmi | examinationId | result | category | comment | mh_result1 | mh_category1 | mh_comment1 | mh_result2 | mh_category2 | mh_category2 |
|-----------|------------|-----|-----|---------------|------------|----------------|---------------------------------------------|------------|--------------|-------------------------------------|------------|--------------|--------------|
| 1 | 2024-07-08 | 45 | 22 | 45 | Long Term | Cardiovascular | Cardiovascular defect finding, fup required | Short Term | Respiratory | Shortness of breath, med prescribed | None | None | None |
| 1 | 2024-02-01 | 45 | 22 | 33 | None | None | None | Short Term | Respiratory | Shortness of breath, med prescribed | None | None | None |
| 1 | 2023-11-14 | 45 | 22 | 12 | Short Term | Respiratory | Shortness of breath, med prescribed | None | None | None | None | None | None |
| 2 | 2023-12-23 | 32 | 12 | 18 | Short Term | Gastro | Recorded malnutrition | Short Term | Gastro | None | None | None | None |
| 2 | 2022-12-11 | 32 | 13 | 21 | Short Term | Gastro | None | None | None | None | None | None | None |
You could pivot
, groupby.transform
with ffill
+shift
, then merge
:
tmp = (df
.sort_values(by='apptDate')
.assign(col=lambda x: x.groupby('patientId').cumcount().add(1))
.pivot(index=['patientId', 'apptDate'], columns='col', values=['result', 'category', 'comment'])
.sort_index(level=1, axis=1, sort_remaining=False)
.groupby(level='patientId').transform(lambda x: x.ffill().shift())
)
tmp.columns = tmp.columns.map(lambda x: f'mh_{x[0]}{x[1]}')
out = df.merge(tmp, left_on=['patientId', 'apptDate'], right_index=True, how='left')
Output:
patientId apptDate age bmi examinationId result category comment mh_result1 mh_category1 mh_comment1 mh_result2 mh_category2 mh_comment2 mh_result3 mh_category3 mh_comment3
0 1 2024-07-08 45 22 45 Long Term Cardiovascular Cardiovascular defect finding, fup required Short Term Respiratory Shortness of breath, med prescribed NaN NaN NaN None None None
1 1 2024-02-01 45 22 33 NaN NaN NaN Short Term Respiratory Shortness of breath, med prescribed NaN NaN NaN None None None
2 1 2023-11-14 45 22 12 Short Term Respiratory Shortness of breath, med prescribed None None None NaN NaN NaN None None None
3 2 2023-12-23 32 12 18 Short Term Gastro Recorded malnutrition Short Term Gastro NaN None None None NaN NaN NaN
4 2 2022-12-11 32 13 21 Short Term Gastro NaN None None NaN None None None NaN NaN NaN