Suppose I have a dataframe with multiindex columns representing student majors and student GPAs. I want to find the class rank per major for each row. If the students had only one major each, then I could stack the major as another level of the multiindex and groupby(level=1).rank(). However, because the students can change their major, I have to have those as a separate feature for each year. What would be an efficient way to get the correct output? I know I could brute force it by iterating over the rows, but I'm hoping to avoid that as I may be dealing with a large amount of data.
Sample df:
categories = ['major', 'gpa']
students = ['Joe', 'Bob', 'Sara']
columns = pd.MultiIndex.from_product([categories, students])
index = range(4)
students_df = pd.DataFrame([['english', 'math', 'math', 3.8, 2.2, 3.7],
['english', 'math', 'math', 3.5, 2.4, 3.9],
['english', 'english', 'math', 3.5, 3.6, 3.9],
['english', 'english', 'math', 3.7, 3.5, 3.8],
], index=range(4), columns=columns)
print(students_df)
major gpa
Joe Bob Sara Joe Bob Sara
0 english math math 3.8 2.2 3.7
1 english math math 3.5 2.4 3.9
2 english english math 3.5 3.6 3.9
3 english english math 3.7 3.5 3.8
Expected Output
rank
Joe Bob Sara
0 1 2 1
1 1 2 1
2 2 1 1
3 1 2 1
You can use stack
to setup the dataframe, so that you can effectively groupby
the index ([0,1,2,3]) and major
to calculate the rank
. Then, unstack(1)
the dataframe back into it's original column. The integer parameter passed to stack()
indicates the position of the multi-index column level, that you send from columns to rows (or wide-to-long). And, unstack()
is the exact opposite -- it sends from rows to columns (or long-to-wide). So, 0
is the first level, 1
is the second level, etc. Regardless of position, stack()
and unstack()
are very effective at unpivoting / pivoting dataframes.
Here is method 1 utilizing .index.get_level_values(0)
:
df = students_df.stack(1)
df = (df.assign(gpa=df.groupby([df.index.get_level_values(0), 'major'])
.rank(ascending=False)).unstack(1))
Out[1]:
gpa major
Bob Joe Sara Bob Joe Sara
0 2.0 1.0 1.0 math english math
1 2.0 1.0 1.0 math english math
2 1.0 2.0 1.0 english english math
3 2.0 1.0 1.0 english english math
Here is method 2 utilizing rename_axis()
but has a slightly different output:
df = students_df.stack(1).rename_axis(['', 'Name'])
df = df.assign(gpa=df.groupby(['', 'major']).rank(ascending=False)).unstack(1)
df
Out[2]:
gpa major
Name Bob Joe Sara Bob Joe Sara
0 2.0 1.0 1.0 math english math
1 2.0 1.0 1.0 math english math
2 1.0 2.0 1.0 english english math
3 2.0 1.0 1.0 english english math