Search code examples
python-3.xpandasdataframegroup-by

Collapse a Pandas Dataframe so as to retain all columns but specify which order the columns should be stored depending on group by/pivoted columns


I have a Pandas Dataframe that contains multiple findings for an individual (medical history) and I want to have an individuals history all collapsed in one row but with the order retained but at the appointment date level, say all findings/examination results form their past formatted in wide format.

I don't know how best to do this as all groupby methods require me to provide an agg that then just combines all columns into one column via join as opposed to new separate columns for a given appointment in the past.

Some columns will not be pivoted or used as groupby index (patientId, apptDate, age, bmi)

A consideration is how to best deal with ordering the created medical history mh_ columns so that records are populated in the lower generated columns first mh_result1 etc.

Original DF

| patientId | apptDate   | age | bmi | examinationId | result     | category       | comment                                     |
|-----------|------------|-----|-----|---------------|------------|----------------|---------------------------------------------|
| 1         | 2024-07-08 | 45  | 22  | 45            | Long Term  | Cardiovascular | Cardiovascular defect finding, fup required |
| 1         | 2024-02-01 | 45  | 22  | 33            | None       | None           | None                                        |
| 1         | 2023-11-14 | 45  | 22  | 12            | Short Term | Respiratory    | Shortness of breath, med prescribed         |
| 2         | 2023-12-23 | 32  | 12  | 18            | Short Term | Gastro         | Recorded malnutrition                       |
| 2         | 2022-12-11 | 32  | 13  | 21            | Short Term | Gastro         | None                                        |

Desired DF

| patientId | apptDate   | age | bmi | examinationId | result     | category       | comment                                     | mh_result1 | mh_category1 | mh_comment1                         | mh_result2 | mh_category2 | mh_category2 |
|-----------|------------|-----|-----|---------------|------------|----------------|---------------------------------------------|------------|--------------|-------------------------------------|------------|--------------|--------------|
| 1         | 2024-07-08 | 45  | 22  | 45            | Long Term  | Cardiovascular | Cardiovascular defect finding, fup required | Short Term | Respiratory  | Shortness of breath, med prescribed | None       | None         | None         |
| 1         | 2024-02-01 | 45  | 22  | 33            | None       | None           | None                                        | Short Term | Respiratory  | Shortness of breath, med prescribed | None       | None         | None         |
| 1         | 2023-11-14 | 45  | 22  | 12            | Short Term | Respiratory    | Shortness of breath, med prescribed         | None       | None         | None                                | None       | None         | None         |
| 2         | 2023-12-23 | 32  | 12  | 18            | Short Term | Gastro         | Recorded malnutrition                       | Short Term | Gastro       | None                                | None       | None         | None         |
| 2         | 2022-12-11 | 32  | 13  | 21            | Short Term | Gastro         | None                                        | None       | None         | None                                | None       | None         | None         |

Solution

  • You could pivot, groupby.transform with ffill+shift, then merge:

    tmp = (df
       .sort_values(by='apptDate')
       .assign(col=lambda x: x.groupby('patientId').cumcount().add(1))
       .pivot(index=['patientId', 'apptDate'], columns='col', values=['result', 'category', 'comment'])
       .sort_index(level=1, axis=1, sort_remaining=False)
       .groupby(level='patientId').transform(lambda x: x.ffill().shift())
    )
    
    tmp.columns = tmp.columns.map(lambda x: f'mh_{x[0]}{x[1]}')
    
    out = df.merge(tmp, left_on=['patientId', 'apptDate'], right_index=True, how='left')
    

    Output:

       patientId    apptDate  age  bmi  examinationId      result        category                                      comment  mh_result1 mh_category1                          mh_comment1 mh_result2 mh_category2 mh_comment2 mh_result3 mh_category3 mh_comment3
    0          1  2024-07-08   45   22             45   Long Term  Cardiovascular  Cardiovascular defect finding, fup required  Short Term  Respiratory  Shortness of breath, med prescribed        NaN          NaN         NaN       None         None        None
    1          1  2024-02-01   45   22             33         NaN             NaN                                          NaN  Short Term  Respiratory  Shortness of breath, med prescribed        NaN          NaN         NaN       None         None        None
    2          1  2023-11-14   45   22             12  Short Term     Respiratory          Shortness of breath, med prescribed        None         None                                 None        NaN          NaN         NaN       None         None        None
    3          2  2023-12-23   32   12             18  Short Term          Gastro                        Recorded malnutrition  Short Term       Gastro                                  NaN       None         None        None        NaN          NaN         NaN
    4          2  2022-12-11   32   13             21  Short Term          Gastro                                          NaN        None         None                                  NaN       None         None        None        NaN          NaN         NaN