I would like to calculate the median line by line in a dataframe of more than 500,000 rows. For the moment I'm using np.median
because numpy is optimized to run on a single core. It's still very slow and I'd like to find a way to parallel the calculation
Specifically, I have N
tables of size 13 x 500,000
and for each table I want to add the columns Q1, Q3 and median so that for each row the median column contains the median of the row. So I have to calculate N * 500,000
median values.
I tried with numexpr
but it doesn't seem possible.
EDIT : In fact I also need Q1 and Q3 so I can't use the statistics module which doesn't allow to calculate quartiles. Here is how I calculate the median for the moment
q = np.transpose(np.percentile(data[row_array], [25,50,75], axis = 1))
data['Q1_' + family] = q[:,0]
data['MEDIAN_' + family] = q[:,1]
data['Q3_' + family] = q[:,2]
EDIT 2 I solved my problem by using the median of median algorithm as proposed below
If a (close) approximation of the median is OK for your purposes, you should consider computing a median of medians, which is a divide and conquer strategy that can be executed in parallel. In principle, MoM has O(n)
complexity for serial execution, approaching O(1)
for parallel execution on massively parallel systems.
See this Wiki entry for a description and pseudo-code. See also this question on Stack Overflow and discussion of the code, and this ArXiv paper for a GPU implementation.