Search code examples
pythonalgorithmnumpymedian

how does numpy find the median in an array/list?


I read, that numpy uses introselect to find the median in an array/ list (https://www.researchgate.net/publication/303755458_Fast_Deterministic_Selection) [page 2; last 5 lines]. But I couldn't find any hints for that in the numpy source code: https://github.com/numpy/numpy/blob/v1.19.0/numpy/lib/function_base.py#L3438-L3525

Does anyone know where I could find the numpy implementation of introselect? Or if numpy doesn't use introselect, what kind of algorithm do the use to find the median?

Many thanks in advance :)


Solution

  • In line 3528 seems to be the main median function. If we cut out all the multidimensional and nan stuff we get something like

    def _median(a, axis=None, out=None, overwrite_input=False):
        # can't be reasonably be implemented in terms of percentile as we have to
        # call mean to not break astropy
    
        # Set the partition indexes
        sz = a.shape
        if sz % 2 == 0:
            szh = sz // 2
            kth = [szh - 1, szh]
        else:
            kth = [(sz - 1) // 2]
    
        part = partition(a, kth, axis=None)
    
        return mean(part[indexer], axis=None, out=out)
    

    So partition is doing all the work and comes from

    from numpy.core.fromnumeric import (
        ravel, nonzero, partition, mean, any, sum
        )
    

    If we go to the numpy code we get to the following C code.

    NPY_SELECTKIND sortkind = NPY_INTROSELECT;
    

    and

    val = PyArray_Partition(self, ktharray, axis, sortkind);
    

    Which implemented here and uses

    mid = ll + median_of_median5_@suff@(v + ll, hh - ll, NULL, NULL);
    

    So it is introselect.

    Once twice the recursion depth is reached the algorithm change to use meadian-of-median5 until the partition is less than 5.