Search code examples
pythonpandasdata-cleaningpreprocessordataframe

How do I pivot a pandas DataFrame and then add hierarchical columns?


Can someone please help me understand the steps to convert a Python pandas DataFrame that is in record form (data set A), into one that is pivoted with nested columns (as shown in data set B)?

For this question the underlying schema has the following rules:

  • Each ProjectID appears once
  • Each ProjectID is associated to a single PM
  • Each ProjectID is associated to a single Category
  • Multiple ProjectIDs can be associated with a single Category
  • Multiple ProjectIDs can be associated with a single PM

Input Data Set A

df_A = pd.DataFrame({'ProjectID':[1,2,3,4,5,6,7,8],
          'PM':['Bob','Jill','Jack','Jack','Jill','Amy','Jill','Jack'],
          'Category':['Category A','Category B','Category C','Category B','Category A','Category D','Category B','Category B'],
          'Comments':['Justification 1','Justification 2','Justification 3','Justification 4','Justification 5','Justification 6','Justification 7','Justification 8'],
          'Score':[10,7,10,5,15,10,0,2]})

enter image description here

Desired Output enter image description here Notice above the addition of a nested index across the columns. Also notice that 'Comments' and 'Score' both appear at the same level beneath 'ProjectID'. Finally see how the desired output does NOT aggregate any data, but groups/merges the category data into one row per category value.

I have tried so far:

  • df_A.set_index(['Category','ProjectID'],append=True).unstack() - This would only work if I first create a nested index of ['Category','ProjectID] and ADD that to the original numerical index created with a standard dataframe, however it repeats each instance of a Category/ProjectID match as its own row (because of the original index).
  • df_A.groupby() - I wasn't able to use this because it appears to force aggregation of some sort in order to get all of the values of a single category on a single row.
  • df_A.pivot('Category','ProjectID',values='Comments') - I can perform a pivot to avoid unwanted aggregation and it starts to look similar to my intended output, but can only see the 'Comments' field and also cannot set nested columns this way. I receive an error when trying to set values=['Comments','Score'] in the pivot statement.

I think the answer is somewhere between pivot, unstack, set_index, or groupby, but I don't know how to complete the pivot, and then add the appropriate nested column index.

I'd appreciate any thoughts you all have.
Question updated based on Mr. T's comments. Thank you.


Solution

  • I think this is what you are looking for:

    pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2)
    
    Out[4]:
    PM                        Amy                    Bob        ...              Jill
    ProjectID                   6                      1        ...                 5                      7
                         Comments Score         Comments Score  ...          Comments Score         Comments Score
      Category                                                  ...
    0 Category A              NaN   NaN  Justification 1    10  ...   Justification 5    15              NaN   NaN
      Category B              NaN   NaN              NaN   NaN  ...               NaN   NaN  Justification 7     0
      Category C              NaN   NaN              NaN   NaN  ...               NaN   NaN              NaN   NaN
      Category D  Justification 6    10              NaN   NaN  ...               NaN   NaN              NaN   NaN
    
    [4 rows x 16 columns]
    

    EDIT: To select rows by category you should get rid of the row index 0 by adding .xs():

    In [3]: df_A_transformed = pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2).xs(0)
    
    In [4]: df_A_transformed
    Out[4]:
    PM                      Amy                    Bob        ...              Jill
    ProjectID                 6                      1        ...                 5                      7
                       Comments Score         Comments Score  ...          Comments Score         Comments Score
    Category                                                  ...
    Category A              NaN   NaN  Justification 1    10  ...   Justification 5    15              NaN   NaN
    Category B              NaN   NaN              NaN   NaN  ...               NaN   NaN  Justification 7     0
    Category C              NaN   NaN              NaN   NaN  ...               NaN   NaN              NaN   NaN
    Category D  Justification 6    10              NaN   NaN  ...               NaN   NaN              NaN   NaN
    
    [4 rows x 16 columns]
    
    In [5]: df_A_transformed.loc['Category B']
    Out[5]:
    PM    ProjectID
    Amy   6          Comments                NaN
                     Score                   NaN
    Bob   1          Comments                NaN
                     Score                   NaN
    Jack  3          Comments                NaN
                     Score                   NaN
          4          Comments    Justification 4
                     Score                     5
          8          Comments    Justification 8
                     Score                     2
    Jill  2          Comments    Justification 2
                     Score                     7
          5          Comments                NaN
                     Score                   NaN
          7          Comments    Justification 7
                     Score                     0
    Name: Category B, dtype: object