I have a dataframe like this:
a1 a2 a3 a4
4 4 4 4
4 4 4 4
2 3 2 3
2 3 3 3
2 2 2 2
2 2 2 2
Desired output:
a1 a2 a3 a4 max_freq
4 4 4 4 4
4 4 4 4 4
2 3 2 3 3
2 3 3 3 3
2 2 2 2 2
2 2 2 2 2
I want to return the element from column, that occurs the most horizontally in columns a1,a2,a3,a4. e.g. 4 freq -4, so max_freq=4 and so on. In case of ties, return a4.
I started off with something like:
def get_max_freq(row):
unique, counts = np.unique(np.array(row), return_counts=True)
print (unique,counts)
df_temp.apply(get_max_freq, axis=1)
I am able to get frequency of items row wise. i can go on and try converting them into a dataframe, sort by count and select first element and return from function, but it seems to be a slow method. Is there any pythonic way to address this problem? (which can address the speed on a ~1m rows dataframe)
If you're concerned about speed, and don't care about the restraint on a4
as you mentioned in the comments, you can use scipy.stats.mode
:
df['freq'] = scipy.stats.mode(df.values, 1)[0]
a1 a2 a3 a4 freq
0 4 4 4 4 4
1 4 4 4 4 4
2 2 3 2 3 2
3 2 3 3 3 3
4 2 2 2 2 2
5 2 2 2 2 2
Timings
df = pd.concat([df]*10000)
In [244]: %timeit df.mode(1)
12.7 s ± 268 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [245]: %timeit scipy.stats.mode(df.values, 1)[0]
10.8 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This will give you a massive performance boost over mode(1)