python pandas function multiple-columns frequency

getting item with max frequency from multiple columns in a dataframe

I have a dataframe like this:

a1  a2  a3  a4
4   4   4   4
4   4   4   4
2   3   2   3
2   3   3   3
2   2   2   2
2   2   2   2

Desired output:

a1  a2  a3  a4  max_freq
4   4   4   4   4
4   4   4   4   4
2   3   2   3   3
2   3   3   3   3
2   2   2   2   2
2   2   2   2   2

I want to return the element from column, that occurs the most horizontally in columns a1,a2,a3,a4. e.g. 4 freq -4, so max_freq=4 and so on. In case of ties, return a4.

I started off with something like:

def get_max_freq(row):
    unique, counts = np.unique(np.array(row), return_counts=True)
    print (unique,counts)

df_temp.apply(get_max_freq, axis=1)

I am able to get frequency of items row wise. i can go on and try converting them into a dataframe, sort by count and select first element and return from function, but it seems to be a slow method. Is there any pythonic way to address this problem? (which can address the speed on a ~1m rows dataframe)

Solution

If you're concerned about speed, and don't care about the restraint on a4 as you mentioned in the comments, you can use scipy.stats.mode:

df['freq'] = scipy.stats.mode(df.values, 1)[0]

   a1  a2  a3  a4  freq
0   4   4   4   4     4
1   4   4   4   4     4
2   2   3   2   3     2
3   2   3   3   3     3
4   2   2   2   2     2
5   2   2   2   2     2

Timings

df = pd.concat([df]*10000)

In [244]: %timeit df.mode(1)
12.7 s ± 268 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [245]: %timeit scipy.stats.mode(df.values, 1)[0]
10.8 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This will give you a massive performance boost over mode(1)