Search code examples
pythonpandasfunctionmultiple-columnsfrequency

getting item with max frequency from multiple columns in a dataframe


I have a dataframe like this:

a1  a2  a3  a4
4   4   4   4
4   4   4   4
2   3   2   3
2   3   3   3
2   2   2   2
2   2   2   2

Desired output:

a1  a2  a3  a4  max_freq
4   4   4   4   4
4   4   4   4   4
2   3   2   3   3
2   3   3   3   3
2   2   2   2   2
2   2   2   2   2

I want to return the element from column, that occurs the most horizontally in columns a1,a2,a3,a4. e.g. 4 freq -4, so max_freq=4 and so on. In case of ties, return a4.

I started off with something like:

def get_max_freq(row):
    unique, counts = np.unique(np.array(row), return_counts=True)
    print (unique,counts)

df_temp.apply(get_max_freq, axis=1)

I am able to get frequency of items row wise. i can go on and try converting them into a dataframe, sort by count and select first element and return from function, but it seems to be a slow method. Is there any pythonic way to address this problem? (which can address the speed on a ~1m rows dataframe)


Solution

  • If you're concerned about speed, and don't care about the restraint on a4 as you mentioned in the comments, you can use scipy.stats.mode:

    df['freq'] = scipy.stats.mode(df.values, 1)[0]
    
       a1  a2  a3  a4  freq
    0   4   4   4   4     4
    1   4   4   4   4     4
    2   2   3   2   3     2
    3   2   3   3   3     3
    4   2   2   2   2     2
    5   2   2   2   2     2
    

    Timings

    df = pd.concat([df]*10000)
    
    In [244]: %timeit df.mode(1)
    12.7 s ± 268 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [245]: %timeit scipy.stats.mode(df.values, 1)[0]
    10.8 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    This will give you a massive performance boost over mode(1)