Search code examples
parallel-processingmedian

Parallel computation of the median of a large array


I got asked this question once and still haven't been able to figure it out:

You have an array of N integers, where N is large, say, a billion. You want to calculate the median value of this array. Assume you have m+1 machines (m workers, one master) to distribute the job to. How would you go about doing this?

Since the median is a nonlinear operator, you can't just find the median in each machine and then take the median of those values.


Solution

  • Depending on the Parallel Computation Model, algorithms could vary. (Note: the pdf linked to in previous sentence just contains some of the many possible ones).

    Finding the median is a special case of finding the ith element. This problem is called 'selection problem', so you need to search the web for parallel selection.

    Here is one paper (unfortunately, not free) which might be useful: Parallel Selection Algorithms With Analysis on Clusters.

    And google's first link for the query "Parallel Selection" gives: http://www.umiacs.umd.edu/research/EXPAR/papers/3494/node18.html which actually uses the median of medians for the general problem and not just median finding.