Search code examples
pythonpandasdataframepivot-tabledata-cleaning

Grouping by various columns and rows using pivot table


I have a long and complicated dataset where I am trying to simplify it such that there is only one row per unique ID pairs (found in col 1 (ID1) and col 2 (ID2)). I would like to do this by making one column per unique value in the 'favorite_grades' column (in my actual dataset it is more complicated than 'favorite_grades', so I would like something that can group by unique text). Also, I noticed that sometimes there are multiple 'favorite_grades' that are lumped up into one row and separated by a space.

After each unique text is its own column, I would like the names to be sorted into the row that matches their ID pair, as well as the column that matches the favorite grade of each student. The ones that have the same ID pair and same grade can be grouped together in a list (double quotes around them). I have made a sample data frame as well as the desired result (they are both pandas data frames).

original:

ID1    ID2        name    favorite_grades
 01      01        John     3rd 4th
 01      01        Kate     4th 5th
 01      02        Emily    4th
 01      03        Mark     5th
 01      03        Emma     5th 

desired result:

 ID1     ID2      3rd_grade         4th_grade      5th_grade  
 01      01         "John"          "John, Kate"   "Kate
 01      02                         "Emily"
 01      03                                      "Mark, Emma"

So far, I've tried:

df.pivot_table(index=['ID1','ID2'], columns='grade', values='name',aggfunc=', '.join).reset_index()

But it gives me this:

ID1  ID2     3rd 4th     4th 5th     4th               5th
    1    1      John        Kate                          NaN
         2       NaN        Emily       Emily             NaN
         3       NaN        NaN                           Mark, Emma

First off, the ID1 is blank for all other rows besides the first (I want it to be listed). Also, the '3rd 4th' and '4th 5th' should not be read as their own columns.


Solution

  • Try this:

    df.assign(favorite_grades=df['favorite_grades'].str.split(' '))\
      .explode('favorite_grades')\
      .groupby(['ID1', 'ID2', 'favorite_grades'])['name'].agg(', '.join)\
      .unstack(fill_value='')\
      .reset_index()
    

    Output:

    favorite_grades  ID1  ID2   3rd         4th         5th
    0                  1    1  John  John, Kate        Kate
    1                  1    2             Emily            
    2                  1    3                    Mark, Emma
    

    With double quotes...

    df.assign(favorite_grades=df['favorite_grades'].str.split(' '))\
      .explode('favorite_grades')\
      .groupby(['ID1', 'ID2', 'favorite_grades'])['name']\
      .agg(lambda x: f'''"{', '.join(x)}"''')\
      .unstack(fill_value='')\
      .reset_index())
    
    favorite_grades  ID1  ID2     3rd           4th           5th
    0                  1    1  "John"  "John, Kate"        "Kate"
    1                  1    2               "Emily"              
    2                  1    3                        "Mark, Emma"