Search code examples
pythonpandasmulti-index

Wide to Long MiltiIndex dataset using pandas


I've solved a part of this question Wide to long dataset using pandas but still need more help.

I've a dataset which has columns as: IA01_Raw_Baseline,IA01_Raw_Midline,IA01_Class1_Endline etc. I want to break it so that the mid word i.e. Raw, Class1 etc remain in the column and IAx, timeLine (mid, base, end) becomes two separate columns. For example for a row of data:

ID   Country Type Region Gender IA01_Raw_EndLine  IA01_Raw_Baseline  IA01_Raw_Midline IA02_Raw QA_Include QA_Comments

SC1  France  A    Europe Male   4                 8                  1                         yes       N/A

It should become:

ID   Country Type Region Gender Timeline IA   Raw Class1 Class2 QA_Include QA_Comments

SC1  France  A    Europe Male   Endline  IA01 4    N/A      N/A yes        N/A
SC1  France  A    Europe Male   Baseline IA01 8    N/A      N/A yes        N/A    
SC1  France  A    Europe Male   Midline  IA01 1    N/A      N/A yes        N/A

This is just the conversion of one row where I've 500+ columns which has different IA from 01 to 12 and attributes like Raw, Class1, Class2, Amount etc all has baseline and midline etc.

When converting I did break them down to the columns, where I've idVars containing columns that will be the index and valueVars will have the IA01_Raw_Baseline type columns:

idVars = list(gd_df.columns[0:40]) + list(gd_df.columns[472:527]) #values that will not pivot
valueVars = gd_df.columns[41:472]#.tolist() #value that will pivot

gd_df2 = gd_df.set_index(idVars)
gd_df2.columns = pd.MultiIndex.from_tuples(map(tuple, gd_df2.columns.str.split('_', n=1)))
gd_out =  gd_df2.stack(level=0).reset_index().rename({'level_7': 'IA'}, axis=1)

So this code gave me:

enter image description here

As you can see I got IA column the way I wanted but the timeline is still embedded in the column name. What should I change in my code so that it should give this output:

enter image description here

UPDATE: By doing this:

s=df.set_index(idVars)
s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw','Timeline'])
s.stack([0,2]).reset_index()
s.to_excel(r'gd_out1.xlsx')

I'm getting: enter image description here


Solution

  • Since you mentioned wide to long , we using wide_to_long

    s=pd.wide_to_long(df,['IA01_Raw'],i=['ID', 'Country', 'Type', 'Region', 'Gender','IA02_Raw', 'QA_Include',
           'QA_Comments'],j='Timeline',suffix='\w+',sep='_')
    s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw'])
    s.stack(0).reset_index()
    Out[27]: 
    raw   ID Country Type  Region ...  QA_Comments  Timeline    IA  Raw
    0    SC1  France    A  Europe ...          NaN   EndLine  IA01    4
    1    SC1  France    A  Europe ...          NaN  Baseline  IA01    8
    2    SC1  France    A  Europe ...          NaN   Midline  IA01    1
    [3 rows x 11 columns]
    

    Update

    s=df.set_index(['ID', 'Country', 'Type', 'Region', 'Gender', 'QA_Include',
           'QA_Comments','IA02_Raw'])
    s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw','Timeline'])
    s.stack([0,2]).reset_index()