Search code examples
pythonpandasdataframemulti-indexcolumnsorting

Sorting MultiIndex DataFrame format from columns to variables in Pandas


I have this DataFrame :

      Age         Hgt         Wgt     
      x     y     x     y     x     y
0     26    24    160   164   95    71
1     35    37    182   163   110   68
2     57    52    175   167   89    65

It is a MultiIndex DataFrame.

I'm using pandas to get this final result:

      x_new    y_new    parameter     
0     26       24       Age
1     35       37       Age  
2     57       52       Age
3     160      164      Hgt           
4     182      163      Hgt             
5     175      167      Hgt              
6     95       71       Wgt     
7     110      68       Wgt     
8     89       65       Wgt     

Basically, all the x columns are merged/stacked under one new column x_new, as well as y columns under y_new column. Always the x value should take the y value of the same raw and column.

This is what I tried to do:

First, I used melt() after I joined the column indices and became single index '_'.join(col).strip() It created extra wrong rows. These wrong rows have wrong values, for example: Age_x and Hgt_y in the same row. Remember, always, for example: Age_x and Age_y in the same row. Or, Hgt_x and Hgt_y are in the same row.

Second, I used stack(), and it gave me this result:

df.stack().reset_index(level=0, drop=True).reset_index()
      index    Age      Hgt      Wgt
0     x        26       160      95
1     y        24       164      71  
2     x        35       182      110
3     y        37       163      68
4     x        57       175      89
5     y        52       167      65    

I don't know what else I can do.

Is there a way to turn the MultiIndex DataFrame to the final result that I'm looking for using simple pandascode?


Solution

  • Just specify to stack level=0 instead of the default (level=-1) (droplevel is used to remove the unneeded index level instead of reset_index twice):

    df.stack(level=0).droplevel(0).reset_index()
    
      index    x    y
    0   Age   26   24
    1   Hgt  160  164
    2   Wgt   95   71
    3   Age   35   37
    4   Hgt  182  163
    5   Wgt  110   68
    6   Age   57   52
    7   Hgt  175  167
    8   Wgt   89   65
    

    Add sort_index to order lexicographically before reset_index:

    df.stack(level=0).droplevel(0).sort_index().reset_index()
    
      index    x    y
    0   Age   26   24
    1   Age   35   37
    2   Age   57   52
    3   Hgt  160  164
    4   Hgt  182  163
    5   Hgt  175  167
    6   Wgt   95   71
    7   Wgt  110   68
    8   Wgt   89   65
    

    We can further clean the output by rename_axis and add_suffix to add change the index name and add the '_new' suffix to the 'x' and 'y' columns:

    (df.stack(level=0) 
       .droplevel(0)
       .sort_index() 
       .add_suffix('_new')
       .rename_axis(index='parameter') 
       .reset_index()
    )
    
      parameter  x_new  y_new
    0       Age     26     24
    1       Age     35     37
    2       Age     57     52
    3       Hgt    160    164
    4       Hgt    182    163
    5       Hgt    175    167
    6       Wgt     95     71
    7       Wgt    110     68
    8       Wgt     89     65
    

    Another round about way, using stack/unstack, since it implicitly sorts the levels:

    (df
    .T
    .unstack(1)
    .stack(0)
    .droplevel(1)
    .rename_axis('parameter')
    .add_suffix('_new')
    .reset_index()
    )
    

    Setup used:

    import pandas as pd
    
    df = pd.DataFrame({
        ('Age', 'x'): [26, 35, 57], ('Age', 'y'): [24, 37, 52],
        ('Hgt', 'x'): [160, 182, 175], ('Hgt', 'y'): [164, 163, 167],
        ('Wgt', 'x'): [95, 110, 89], ('Wgt', 'y'): [71, 68, 65]
    })