I am new to Python and Pandas. I want to find the most common item for each row for columns Opt_1 to Opt_7. Note for some reason some of the blank cells appear to have NaN and in other cases it has None.
ID Col_1 Col_2 Opt_1 Opt_2 Opt_3 Opt_4 Opt_5 Opt_6 Opt_7
1 Game 1 Team 1 13
2 Game 1 Team 2 -13
3 Game 1 Team 1
4 Game 1 Team 2
5 Game 2 Team 1 -7.5 -7.5 -7.5 -7.5
6 Game 2 Team 2 7.5 7.5 7.5 7.5
7 Game 2 Team 1 -2.5 -1.5
8 Game 2 Team 2 2.5 1.5
9 Game 3 Team 1 3.5 3.5
10 Game 3 Team 2 -3.5 -3.5
11 Game 3 Team 1 -1 -1.5 -1
12 Game 3 Team 2 1 1.5 1
I already tried the following code, which worked as expected for most rows, but not all. And it was a bit slow.
def freq_value(series):
return Counter(series).most_common()[0][0]
for row in df.iterrows():
df['result'] = df.apply(lambda row: freq_value((row['Opt_1'], row['Opt_2'], row['Opt_3'], row['Opt_4'], row['Opt_5'], row['Opt_6'], row['Opt_7'])), axis=1)
Below is the expected results and the actual results:
ID Expected Actual Result
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 -7.5 -7.5
6 7.5 7.5
7 NaN NaN
8 NaN NaN
9 3.5 3.5
10 -3.5 -3.5
11 -1 NaN
12 1 NaN
Is there anyway of doing this so it is 100% correct, and perhaps done without iterating through each row one at a time? Thanks in advance for any suggestions.
Use filter
to select columns and mode
+ mask
to find unique modes only:
(df.filter(like='Opt')
.mode(axis=1)
.set_axis(['a', 'b'], axis=1, inplace=False)
.eval('a.mask(b.notna())', engine='python'))
0 13.0
1 -13.0
2 NaN
3 NaN
4 -7.5
5 7.5
6 NaN
7 NaN
8 3.5
9 -3.5
10 -1.0
11 1.0
Name: a, dtype: float64
mode
will return all modes for a given row. This means that if there are two values that are equally frequent, there will be two columns in the output. The solution above handles the case when there are at most two columns.
If the mode is unique, the solution can be simplified to
df.filter(like='Opt').mode(axis=1).iloc[:, 0]
Another solution when there is no unique mode, this will generalise to any number of columns in the output.
u = df.filter(like='Opt').mode(axis=1)
if len(u.columns) > 1:
u = u.iloc[:, 0].where(u.iloc[:, 1:].isna().all(axis=1))
u
0 13.0
1 -13.0
2 NaN
3 NaN
4 -7.5
5 7.5
6 NaN
7 NaN
8 3.5
9 -3.5
10 -1.0
11 1.0
Name: 0, dtype: float64