I've solved a part of this question Wide to long dataset using pandas but still need more help.
I've a dataset which has columns as:
IA01_Raw_Baseline
,IA01_Raw_Midline
,IA01_Class1_Endline
etc. I want to break it so that the mid word i.e. Raw, Class1 etc remain in the column and IAx, timeLine (mid, base, end) becomes two separate columns. For example for a row of data:
ID Country Type Region Gender IA01_Raw_EndLine IA01_Raw_Baseline IA01_Raw_Midline IA02_Raw QA_Include QA_Comments
SC1 France A Europe Male 4 8 1 yes N/A
It should become:
ID Country Type Region Gender Timeline IA Raw Class1 Class2 QA_Include QA_Comments
SC1 France A Europe Male Endline IA01 4 N/A N/A yes N/A
SC1 France A Europe Male Baseline IA01 8 N/A N/A yes N/A
SC1 France A Europe Male Midline IA01 1 N/A N/A yes N/A
This is just the conversion of one row where I've 500+ columns which has different IA from 01 to 12 and attributes like Raw, Class1, Class2, Amount etc all has baseline and midline etc.
When converting I did break them down to the columns, where I've idVars
containing columns that will be the index and valueVars
will have the IA01_Raw_Baseline
type columns:
idVars = list(gd_df.columns[0:40]) + list(gd_df.columns[472:527]) #values that will not pivot
valueVars = gd_df.columns[41:472]#.tolist() #value that will pivot
gd_df2 = gd_df.set_index(idVars)
gd_df2.columns = pd.MultiIndex.from_tuples(map(tuple, gd_df2.columns.str.split('_', n=1)))
gd_out = gd_df2.stack(level=0).reset_index().rename({'level_7': 'IA'}, axis=1)
So this code gave me:
As you can see I got IA
column the way I wanted but the timeline is still embedded in the column name. What should I change in my code so that it should give this output:
UPDATE: By doing this:
s=df.set_index(idVars)
s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw','Timeline'])
s.stack([0,2]).reset_index()
s.to_excel(r'gd_out1.xlsx')
Since you mentioned wide to long , we using wide_to_long
s=pd.wide_to_long(df,['IA01_Raw'],i=['ID', 'Country', 'Type', 'Region', 'Gender','IA02_Raw', 'QA_Include',
'QA_Comments'],j='Timeline',suffix='\w+',sep='_')
s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw'])
s.stack(0).reset_index()
Out[27]:
raw ID Country Type Region ... QA_Comments Timeline IA Raw
0 SC1 France A Europe ... NaN EndLine IA01 4
1 SC1 France A Europe ... NaN Baseline IA01 8
2 SC1 France A Europe ... NaN Midline IA01 1
[3 rows x 11 columns]
Update
s=df.set_index(['ID', 'Country', 'Type', 'Region', 'Gender', 'QA_Include',
'QA_Comments','IA02_Raw'])
s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw','Timeline'])
s.stack([0,2]).reset_index()