Search code examples
python-3.xsortingnumpyindexingmedian

Finding the position of the median of an array containing mostly zeros


I have a very large 1d array with most elements being zero while nonzero elements are all clustered around some few islands separated by many zeros: (here is a smaller version of that for the purpose of a MWE)

In [1]: import numpy as np

In [2]: A=np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,6,20,14,10,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,4,5,5,18,18,16,14,10,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,3,6,16,4,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])

I want to find the median and its position (even approximately) in terms of the index corresponding to the median value of each island. Not surprisingly, I am getting zero which is not what I desire:

In [3]: np.median(A)
Out[3]: 0.0

In [4]: np.argsort(A)[len(A)//2]
Out[4]: 12

In the case of a single island of nonzero elements, to work around this caveat and meet my requirement that only nonzero elements are physically meaningful, I remove all zeros first and then take the median of the remaining elements:

In [5]: masks = np.where(A>0)
In [6]: A[masks]
Out[6]: array([ 1,  3,  6, 20, 14, 10,  5,  1])

This time, I get the median of the new array correctly, however the position (index) would not be correct as it is evident and also pointed out in the comments as being ill-defined mathematically.

In [7]: np.median(A[masks])
Out[7]: 5.5

In [8]: np.argsort(A[masks])[len(A[masks])//2]
Out[8]: 2

According to this approximation, I know that real median is located in the third index of the modified array but I would like to translate it back into the format of the original array where the position (index) of the median should be somewhere in the middle of the first island of the nonzero elements corresponding to a larger index (where indices of zeros are all counted correctly). Also answered in the comments are two suggestions made to come up with the position of the median given one island of nonzero elements in the middle of a sea of zeros. But what if there is more than one such island? How could possibly one calculate the index corresponding to median of each island in the context of the original histogram array where zeros are all counted?

I am wondering if there is any easy way to calculate the position of the median in such arrays of many zeros. If not, what else should I add to my lines of code to make that possible after knowing the position in the modified array? Your help is great appreciated.


Solution

  • Based on the comment "A is actually a discrete histogram with many bins", I think what you want is the median of the values being counted. If A is an integer array of counts, then an exact (but probably very inefficient, if you have values as high as 1e7) formula for the median is

    np.median(np.repeat(np.arange(len(A)), A))  # Do not use if A contains very large values!
    

    Alternatively, you can use

    np.searchsorted(A.cumsum(), 0.5*A.sum())
    

    which will be the integer part of the median.

    For example:

    In [157]: A
    Out[157]: 
    array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
            0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  3,
            6, 20, 14, 10,  5,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
            0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
            0,  0,  0,  0])
    
    In [158]: np.median(np.repeat(np.arange(len(A)), A))
    Out[158]: 35.5
    
    In [159]: np.searchsorted(A.cumsum(), 0.5*A.sum())
    Out[159]: 35
    

    Another example:

    In [167]: B
    Out[167]: 
    array([  0,   0,   0,   1, 100,  21,   8,   3,   2,   1,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0])
    
    In [168]: np.median(np.repeat(np.arange(len(B)), B))
    Out[168]: 4.0
    
    In [169]: np.searchsorted(B.cumsum(), 0.5*B.sum())
    Out[169]: 4