Search code examples
pythonpandasdataframepivot

How to transform dataframe long to wide with "grouped" columns?


When pivoting the following dataframe from long to wide, I would like to get "groups" of columns and mark them with a prefix or suffix.

  • The groups of elements can have different sizes, i.e. consist of one, two or more grouped elements/rows, I used pairs of two here to keep the example simple.
import pandas as pd

df = pd.DataFrame(
    [
        {'group': 'group-009297534',  'single_id': 'single-011900051',  'country': 'ESP',  'name': '00000911'},
        {'group': 'group-009297534',  'single_id': 'single-000000821',  'country': 'USA',  'name': '00001054'},
        {'group': 'group-009280053',  'single_id': 'single-000000002',  'country': 'HUN',  'name': '00000496'},
        {'group': 'group-009280053',  'single_id': 'single-000000014',  'country': 'HUN',  'name': '00000795'},
        {'group': 'group-009245039',  'single_id': 'single-000001258',  'country': 'NOR',  'name': '00000527'},
        {'group': 'group-009245039',  'single_id': 'single-000000669',  'country': 'TWN',  'name': '00000535'}
    ]
)

My approach of assigning an index to the elements to be grouped and then using this for the column designation is already going in the right direction, but still deviates from the expected view

df['idx'] = df.groupby('group').cumcount()
df.pivot(index='group', columns='idx')
group ('single_id', 0) ('single_id', 1) ('country', 0) ('country', 1) ('name', 0) ('name', 1)
group-009245039 single-000001258 single-000000669 NOR TWN 00000527 00000535
group-009280053 single-000000002 single-000000014 HUN HUN 00000496 00000795
group-009297534 single-011900051 single-000000821 ESP USA 00000911 00001054

However, the expected solution would look like this:

group single_id_1 country_1 name_1 single_id_2 country_2 name_2
0 group-009245039 single-000001258 NOR 00000527 single-000000669 TWN 00000535
1 group-009280053 single-000000002 HUN 00000496 single-000000014 HUN 00000795
2 group-009297534 single-011900051 ESP 00000911 single-000000821 USA 00001054

I'm not sure whether the approach with the multi-index, which would then have to be sorted and merged somehow, is the right one or whether there is a more elegant option.


Solution

  • Code

    out = (df
           .assign(index=df.groupby('group').cumcount().add(1).astype('str'))
           .pivot(index='group', columns='index')
           .pipe(lambda x: x.set_axis(x.columns.map('_'.join), axis=1))
           .sort_index(axis=1, 
                       key=lambda x: x.str.extract(r'.+_(\d+$)')[0].astype('int'))
           .reset_index()
    )
    

    out

    enter image description here