I have a dataframe like:
lst = [["High", "A"], ["High", "A"], ["High", "B"],["Medium", "A"], ["Medium", "B"], ["Medium", "C"]]
df = pd.DataFrame(lst, columns =["Class", "Grade"])
I need to get the mode (majority vote) of "Grade" in each "Class". If it's a tie vote, assign "x".
Below is what I expect to get:
Class | Grade | Majority_vote |
---|---|---|
High | A | A |
High | A | A |
High | B | A |
Medium | A | x |
Medium | B | x |
Medium | C | x |
This is my code:
df['majority_vote'] = df.groupby(['Class'])['Grade'].transform(lambda x: x.mode()[0])
I think the code will return 'nan' if it's a tie vote. Then, I will change 'nan' to 'x' later.
However, what I get is below:
Class | Grade | Majority_vote |
---|---|---|
High | A | A |
High | A | A |
High | B | A |
Medium | A | A |
Medium | B | A |
Medium | C | A |
At class "Medium", the code returns the 1st element ("A") instead of 'nan'.
Any other method is appreciated. Could you please help me? Thank you in advance.
The issue with using x.mode()[0]
is that pd.Series(['A', 'B', 'C']).mode()
evaluates to ['A', 'B', 'C']
. Meanwhile, pd.Series(['A', 'A', 'B']).mode()
evaluates to ['A']
.
Here is a function that will return the mode (if there is only one) and "x" if there is a tie (i.e., multiple modes).
import pandas as pd
lst = [["High", "A"], ["High", "A"], ["High", "B"],["Medium", "A"], ["Medium", "B"], ["Medium", "C"]]
df = pd.DataFrame(lst, columns=["Class", "Grade"])
def get_mode_or_x(series):
mode = series.mode()
if mode.size == 1:
return mode[0]
return "x"
df.loc[:, "majority_vote"] = df.groupby("Class")["Grade"].transform(get_mode_or_x)
index | Class | Grade | majority_vote |
---|---|---|---|
0 | High | A | A |
1 | High | A | A |
2 | High | B | A |
3 | Medium | A | x |
4 | Medium | B | x |
5 | Medium | C | x |