Search code examples
pythonnumpyscipymedian

python/numpy fastest method for 2d kernel rank filtering on masked arrays (and/or selective ranking)


Given a 2D numpy array

MyArray = np.array([[ 8.02,  9.54,  0.82,  7.56,  2.26,  9.47],
           [ 2.68,  7.3 ,  2.74,  3.03,  2.25,  8.84],
           [ 2.21,  3.62,  0.55,  2.94,  5.77,  0.21],
           [ 5.78,  5.72,  8.85,  0.24,  5.37,  9.9 ],
           [ 9.1 ,  7.21,  4.14,  9.95,  6.73,  6.08],
           [ 1.8 ,  5.14,  5.02,  6.52,  0.3 ,  6.11]])

and a mask array

MyMask =  np.array([[ 0.,  0.,  1.,  1.,  0.,  1.],
            [ 1.,  0.,  0.,  0.,  0.,  1.],
            [ 0.,  0.,  0.,  1.,  0.,  0.],
            [ 0.,  1.,  1.,  1.,  1.,  0.],
            [ 0.,  1.,  0.,  1.,  0.,  0.],
            [ 0.,  1.,  0.,  0.,  1.,  1.]])

I wish to run a 'holey' median filter that ignores masked elements.

For example, a rank filter with a kernel

k = np.array([[ 1, 1, 1],
              [ 1, 0, 1],
              [ 1, 1, 1]]); 

would run on MyArray: sorting the neighbourhood defined by the kernel for each element of MyArray and returning the median of only the non-masked elements (averaging if the array is an even number).

Now, currently I am doing this in unpythonic loops, using bottleneck.nanmedian by mapping the mask to NaNs. This is giving me exactly what I need, but I was hoping to rely on 2D array manipulation routines.

scipy.signal.order_filter and scipy.ndimage.filters.rank_filter are both available (rank_filter appears to be much faster), but it appears they sort NaN and Inf at the top of the array before returning the rank and biasing the result. It seems neither of these methods support numpy.ma arrays (masking), nor do they accept an array of selective ranks (then I could fill all masks with 0 and offset my rank), nor is there an obvious way to vary the kernel for each location.

I am wondering if I have missed a combination and/or python feature, or if I should be looking to implement a new routine in Cython.

Ignoring boundary handling, the internal points of the above problem would be

[[ 0.     0.     0.     0.     0.     0.   ]
 [ 0.     3.18   3.62   2.26   2.645  0.   ]
 [ 0.     2.74   3.325  2.74   2.64   0.   ]
 [ 0.     3.88   3.62   4.955  6.08   0.   ]
 [ 0.     5.02   5.77   5.77   6.52   0.   ]
 [ 0.     0.     0.     0.     0.     0.   ]]

Solution

  • One way is to sacrifice RAM usage to forego the Python loops. I.e. we blow up the original array so that we can apply the filter on all sub-arrays at once. Which is kind of similar to Numpy broadcasting.

    For an array of 1000x1000 the vectorized function performs roughly 100x faster, in my testing.

    In my code I used NaN's for masking, but with some extra lines of code you could also use numpy.ma arrays. And I didn't have a nanmedian function so I used nanmean, performance should be comparable though.

    import numpy as np
    from numpy.lib.stride_tricks import as_strided
    
    # test data
    N = 1000
    A = np.random.rand(N, N)*10
    mask = np.random.choice([True, False], size=(N, N))
    
    def filter_loop(A, mask):
        kernel = np.array([[1,1,1],[1,0,1],[1,1,1]], bool)
        A = A.copy()
        A[mask] = np.nan
        N = A.shape[0] - 2  # assuming square matrix
        out = np.empty((N, N))
        for i in xrange(N):
            for j in xrange(N):
                out[i,j] = np.nanmean(A[i:i+3, j:j+3][kernel])
        return out    
    
    def filter_broadcast(A, mask):
        A = A.copy()
        A[mask] = np.nan
        N = A.shape[0] - 2
        B = as_strided(A, (N, N, 3, 3), A.strides+A.strides)
        B = B.copy().reshape((N, N, 3*3))
        B[:,:,4] = np.nan
        return np.nanmean(B, axis=2)