Search code examples
pythonpandasdataframemode

Find the mode across multiple columns for each row of a pandas DataFrame


I am new to Python and Pandas. I want to find the most common item for each row for columns Opt_1 to Opt_7. Note for some reason some of the blank cells appear to have NaN and in other cases it has None.

ID  Col_1   Col_2   Opt_1   Opt_2   Opt_3   Opt_4   Opt_5   Opt_6   Opt_7 
1   Game 1  Team 1  13                       
2   Game 1  Team 2  -13                      
3   Game 1  Team 1                           
4   Game 1  Team 2                           
5   Game 2  Team 1  -7.5    -7.5    -7.5    -7.5             
6   Game 2  Team 2  7.5     7.5     7.5     7.5          
7   Game 2  Team 1          -2.5    -1.5             
8   Game 2  Team 2          2.5     1.5          
9   Game 3  Team 1          3.5     3.5          
10  Game 3  Team 2          -3.5    -3.5             
11  Game 3  Team 1  -1      -1.5    -1       
12  Game 3  Team 2  1       1.5     1

I already tried the following code, which worked as expected for most rows, but not all. And it was a bit slow.

def freq_value(series):
    return Counter(series).most_common()[0][0]

for row in df.iterrows():
     df['result'] = df.apply(lambda row: freq_value((row['Opt_1'], row['Opt_2'], row['Opt_3'], row['Opt_4'], row['Opt_5'], row['Opt_6'], row['Opt_7'])), axis=1)

Below is the expected results and the actual results:

ID  Expected    Actual Result
1   NaN         NaN
2   NaN         NaN
3   NaN         NaN
4   NaN         NaN
5   -7.5            -7.5
6   7.5         7.5
7   NaN         NaN
8   NaN         NaN
9   3.5         3.5
10  -3.5           -3.5
11  -1          NaN
12  1           NaN

Is there anyway of doing this so it is 100% correct, and perhaps done without iterating through each row one at a time? Thanks in advance for any suggestions.


Solution

  • Use filter to select columns and mode + mask to find unique modes only:

    (df.filter(like='Opt')
       .mode(axis=1)
       .set_axis(['a', 'b'], axis=1, inplace=False)
       .eval('a.mask(b.notna())', engine='python'))
    
    0     13.0
    1    -13.0
    2      NaN
    3      NaN
    4     -7.5
    5      7.5
    6      NaN
    7      NaN
    8      3.5
    9     -3.5
    10    -1.0
    11     1.0
    Name: a, dtype: float64
    

    mode will return all modes for a given row. This means that if there are two values that are equally frequent, there will be two columns in the output. The solution above handles the case when there are at most two columns.

    If the mode is unique, the solution can be simplified to

    df.filter(like='Opt').mode(axis=1).iloc[:, 0]
    

    Another solution when there is no unique mode, this will generalise to any number of columns in the output.

    u = df.filter(like='Opt').mode(axis=1)
    if len(u.columns) > 1:
        u = u.iloc[:, 0].where(u.iloc[:, 1:].isna().all(axis=1))
    
    u
    0     13.0
    1    -13.0
    2      NaN
    3      NaN
    4     -7.5
    5      7.5
    6      NaN
    7      NaN
    8      3.5
    9     -3.5
    10    -1.0
    11     1.0
    Name: 0, dtype: float64