Search code examples
pythonpandaspandas-groupbymulti-index

Pandas groupby using MultiIndex values


Suppose I have a dataframe with multiindex columns representing student majors and student GPAs. I want to find the class rank per major for each row. If the students had only one major each, then I could stack the major as another level of the multiindex and groupby(level=1).rank(). However, because the students can change their major, I have to have those as a separate feature for each year. What would be an efficient way to get the correct output? I know I could brute force it by iterating over the rows, but I'm hoping to avoid that as I may be dealing with a large amount of data.

Sample df:

categories = ['major', 'gpa']
students = ['Joe', 'Bob', 'Sara']
columns = pd.MultiIndex.from_product([categories, students])
index = range(4)
students_df = pd.DataFrame([['english', 'math', 'math', 3.8, 2.2, 3.7],
                            ['english', 'math', 'math', 3.5, 2.4, 3.9],
                            ['english', 'english', 'math', 3.5, 3.6, 3.9],
                            ['english', 'english', 'math', 3.7, 3.5, 3.8],
                            ], index=range(4), columns=columns)
print(students_df)

         major                 gpa          
       Joe      Bob  Sara  Joe  Bob Sara
0  english     math  math  3.8  2.2  3.7
1  english     math  math  3.5  2.4  3.9
2  english  english  math  3.5  3.6  3.9
3  english  english  math  3.7  3.5  3.8

Expected Output

       rank          
       Joe  Bob Sara
0      1      2    1
1      1      2    1
2      2      1    1
3      1      2    1

Solution

  • You can use stack to setup the dataframe, so that you can effectively groupby the index ([0,1,2,3]) and major to calculate the rank. Then, unstack(1) the dataframe back into it's original column. The integer parameter passed to stack() indicates the position of the multi-index column level, that you send from columns to rows (or wide-to-long). And, unstack() is the exact opposite -- it sends from rows to columns (or long-to-wide). So, 0 is the first level, 1 is the second level, etc. Regardless of position, stack() and unstack() are very effective at unpivoting / pivoting dataframes.

    Here is method 1 utilizing .index.get_level_values(0):

    df = students_df.stack(1)
    df = (df.assign(gpa=df.groupby([df.index.get_level_values(0), 'major'])
                          .rank(ascending=False)).unstack(1))
            
    Out[1]: 
       gpa              major               
       Bob  Joe Sara      Bob      Joe  Sara
    0  2.0  1.0  1.0     math  english  math
    1  2.0  1.0  1.0     math  english  math
    2  1.0  2.0  1.0  english  english  math
    3  2.0  1.0  1.0  english  english  math
    

    Here is method 2 utilizing rename_axis() but has a slightly different output:

    df = students_df.stack(1).rename_axis(['', 'Name'])
    df = df.assign(gpa=df.groupby(['', 'major']).rank(ascending=False)).unstack(1)
    df
    
    Out[2]: 
          gpa              major               
    Name  Bob  Joe Sara      Bob      Joe  Sara
    
    0     2.0  1.0  1.0     math  english  math
    1     2.0  1.0  1.0     math  english  math
    2     1.0  2.0  1.0  english  english  math
    3     2.0  1.0  1.0  english  english  math