Search code examples
algorithmselectionmedianclrs

Median select algorithm - does it find the absolute median, or a "median of medians" close to the absolute median?


Section 9.3 in CLRS 3rd edition "Selection in worst-case linear time" talks about the "Select" algorithm (sometimes called the BFPRT algorithm due to Blum, Floyd, Pratt, Rivest, and Tarjan) for finding the median of a list in O(n) time at worst case. I got a little confused as I tried to run a sample on the whiteboard. I understand that a certain number of elements can be eliminated at each call to "Select" (I have read 30% are eliminated versus 70% that need to be checked again), what I was confused about was which part of the array can be eliminated, i.e. if the array is visualized as a matrix with height 5 and width n/5, then which quadrant(s) do the eliminated elements rest in? I originally thought it was two diagonally adjacent quadrants but now I am thinking it is only one quadrant depending on what the median of medians is (see steps 5, 6, and 7 here).

So I went to wikipedia to see if there was a quick explanation with less analysis than CLRS (for the sake of understanding the algorithm before I jumped back to CLRS for the analysis). I came to this, particularly 'Finally, the "median of medians' is chosen to be the pivot." from the sound of the description in wikipedia, "Select" does not find the true median rather a element that is median-enough for the purpose of choosing a pivot for quicksort.

So what does "Select" do in terms of the true median, and how does it do it? The phrase that comes to mind through all of this is "partial hierarchy", which as I understand, is the reason "Select" works, but by what logic can you eliminate elements from the list from being the median based on this partial hierarchy?


Solution

  • It finds the absolute median.

    As you said, "Select" does not find the true median rather a element that is median-enough for the purpose of choosing a pivot for quicksort. In particular it is median enough that it is guaranteed to drop at least 30% of the dataset on every iteration. Unfortunately it is also an expensive operation.

    The key idea is that the median of medians is less than or equal to 3 out of every 5 elements whose median is less than or equal to it. So it is less than or equal to 3 out of every 5 elements for half the groups of 5, so at least 30% of the set is less than or equal to it. So it is in the largest 70% of the data set.

    Similarly it is in the smallest 70% of the data set.

    This guarantees that you avoid the potential pitfall of quickselect, which is picking pivot points that have extreme values.

    If you wish to combine efficiency and a good worst case you can combine this with quickselect. For instance 4 rounds of quickselect followed by one round of this followed by 4 rounds of quickselect, etc. The expensive rounds of BFPRT guarantee O(n) while the quickselect on average is going to be fast. By putting off your first round of BFPRT until you've done several rounds of quickselect you can make the extra running time only a few percent more than quickselect on average. (The worst case cost goes up by quite a bit, but we don't expect to encounter that.)