Search code examples
pythonnumpylarge-datamedian

Faster median in very large numpy arrays


I have a very large numpy array with the dimension of (4000, 6000, 15).

I now want the median for each stack, i.e. along the third dimension. Current code works, but is curiously slow, the median for a single stack [0,0,:] (15 values) takes at least half a second or so to complete.

height = 4000
width = 6000
N = 15

poolmedian = np.zeros((height,width,3))
RGBmedian = np.zeros((height,width,N), dtype=float)    

for n in range(0,height):
    for m in range(0,width):
                poolmedian[n,m,0] = np.median(RGBmedian[n,m,:])

Solution

  • You'll want to vectorize the median computation as much as possible. Every time you call a numpy function, you take a hit going back and forth between the C and Python layer. Do as much in the C layer as possible:

    import numpy as np
    height = 40
    width = 60
    N = 15
    
    np.random.seed(1)
    poolmedian = np.zeros((height,width,3))
    RGBmedian = np.random.random((height,width,N))
    
    def original():
        for n in range(0,height):
            for m in range(0,width):
                poolmedian[n,m,0] = np.median(RGBmedian[n,m,:])
        return poolmedian
    
    def vectorized():
        # Note: np.median is only called ONCE, not n*m times.
        poolmedian[:, :, 0] = np.median(RGBmedian, axis=-1)
        return poolmedian
    
    
    orig = original()
    vec = vectorized()
    
    np.testing.assert_array_equal(orig, vec)
    

    You can see that the values are the same since the assert passes (although it's not clear why you need 3 dims in poolmedian). I put the above code in a file called test.py and am using IPython for it's convenient %timeit. I also toned down the size a bit just so it runs faster, but you should get similar savings on your large data. The vectorized version is about 100x faster:

    In [1]: from test import original, vectorized
    
    In [2]: %timeit original()
    69.1 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [3]: %timeit vectorized()
    618 µs ± 4.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    In general, you want to use numpys broadcasting rules and call a function as few times as possible. Calling functions in a loop is almost always a no-no if you're looking for performant numpy code.

    Addendum:

    I've added the following function to test.py, since there is another answer, I want to make it clear that it's faster to call a fully vectorized version (i.e. no loops), and also modified to code to use dims 4000 by 6000:

    import numpy as np
    height = 4000
    width = 6000
    N = 15
    
    ...
    
    def fordy():
        for n in range(0,height):
            for m in range(0,width):
                array = RGBmedian[n,m,:]
                array.sort()
                poolmedian[n, m, 0] = (array[6] + array[7])/2
        return poolmedian
    

    and if we load all of this into IPython, we get:

    In [1]: from test import original, fordy, vectorized
    
    In [2]: %timeit original()
    6.87 s ± 72.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [3]: %timeit fordy()
    262 ms ± 737 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [4]: %timeit vectorized()
    18.4 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    HTH.