Can someone please help me understand the steps to convert a Python pandas DataFrame that is in record form (data set A), into one that is pivoted with nested columns (as shown in data set B)?
For this question the underlying schema has the following rules:
Input Data Set A
df_A = pd.DataFrame({'ProjectID':[1,2,3,4,5,6,7,8],
'PM':['Bob','Jill','Jack','Jack','Jill','Amy','Jill','Jack'],
'Category':['Category A','Category B','Category C','Category B','Category A','Category D','Category B','Category B'],
'Comments':['Justification 1','Justification 2','Justification 3','Justification 4','Justification 5','Justification 6','Justification 7','Justification 8'],
'Score':[10,7,10,5,15,10,0,2]})
Desired Output Notice above the addition of a nested index across the columns. Also notice that 'Comments' and 'Score' both appear at the same level beneath 'ProjectID'. Finally see how the desired output does NOT aggregate any data, but groups/merges the category data into one row per category value.
I have tried so far:
I think the answer is somewhere between pivot, unstack, set_index, or groupby, but I don't know how to complete the pivot, and then add the appropriate nested column index.
I'd appreciate any thoughts you all have.
Question updated based on Mr. T's comments. Thank you.
I think this is what you are looking for:
pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2)
Out[4]:
PM Amy Bob ... Jill
ProjectID 6 1 ... 5 7
Comments Score Comments Score ... Comments Score Comments Score
Category ...
0 Category A NaN NaN Justification 1 10 ... Justification 5 15 NaN NaN
Category B NaN NaN NaN NaN ... NaN NaN Justification 7 0
Category C NaN NaN NaN NaN ... NaN NaN NaN NaN
Category D Justification 6 10 NaN NaN ... NaN NaN NaN NaN
[4 rows x 16 columns]
EDIT: To select rows by category you should get rid of the row index 0 by adding .xs():
In [3]: df_A_transformed = pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2).xs(0)
In [4]: df_A_transformed
Out[4]:
PM Amy Bob ... Jill
ProjectID 6 1 ... 5 7
Comments Score Comments Score ... Comments Score Comments Score
Category ...
Category A NaN NaN Justification 1 10 ... Justification 5 15 NaN NaN
Category B NaN NaN NaN NaN ... NaN NaN Justification 7 0
Category C NaN NaN NaN NaN ... NaN NaN NaN NaN
Category D Justification 6 10 NaN NaN ... NaN NaN NaN NaN
[4 rows x 16 columns]
In [5]: df_A_transformed.loc['Category B']
Out[5]:
PM ProjectID
Amy 6 Comments NaN
Score NaN
Bob 1 Comments NaN
Score NaN
Jack 3 Comments NaN
Score NaN
4 Comments Justification 4
Score 5
8 Comments Justification 8
Score 2
Jill 2 Comments Justification 2
Score 7
5 Comments NaN
Score NaN
7 Comments Justification 7
Score 0
Name: Category B, dtype: object