Given a 2D numpy array
MyArray = np.array([[ 8.02, 9.54, 0.82, 7.56, 2.26, 9.47],
[ 2.68, 7.3 , 2.74, 3.03, 2.25, 8.84],
[ 2.21, 3.62, 0.55, 2.94, 5.77, 0.21],
[ 5.78, 5.72, 8.85, 0.24, 5.37, 9.9 ],
[ 9.1 , 7.21, 4.14, 9.95, 6.73, 6.08],
[ 1.8 , 5.14, 5.02, 6.52, 0.3 , 6.11]])
and a mask array
MyMask = np.array([[ 0., 0., 1., 1., 0., 1.],
[ 1., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 1., 1., 1., 1., 0.],
[ 0., 1., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 1., 1.]])
I wish to run a 'holey' median filter that ignores masked elements.
For example, a rank filter with a kernel
k = np.array([[ 1, 1, 1],
[ 1, 0, 1],
[ 1, 1, 1]]);
would run on MyArray
: sorting the neighbourhood defined by the kernel for each element of MyArray
and returning the median of only the non-masked elements (averaging if the array is an even number).
Now, currently I am doing this in unpythonic loops, using bottleneck.nanmedian by mapping the mask to NaNs. This is giving me exactly what I need, but I was hoping to rely on 2D array manipulation routines.
scipy.signal.order_filter
and scipy.ndimage.filters.rank_filter
are both available (rank_filter appears to be much faster), but it appears they sort NaN
and Inf
at the top of the array before returning the rank and biasing the result. It seems neither of these methods support numpy.ma arrays (masking), nor do they accept an array of selective ranks (then I could fill all masks with 0 and offset my rank), nor is there an obvious way to vary the kernel for each location.
I am wondering if I have missed a combination and/or python feature, or if I should be looking to implement a new routine in Cython.
Ignoring boundary handling, the internal points of the above problem would be
[[ 0. 0. 0. 0. 0. 0. ]
[ 0. 3.18 3.62 2.26 2.645 0. ]
[ 0. 2.74 3.325 2.74 2.64 0. ]
[ 0. 3.88 3.62 4.955 6.08 0. ]
[ 0. 5.02 5.77 5.77 6.52 0. ]
[ 0. 0. 0. 0. 0. 0. ]]
One way is to sacrifice RAM usage to forego the Python loops. I.e. we blow up the original array so that we can apply the filter on all sub-arrays at once. Which is kind of similar to Numpy broadcasting.
For an array of 1000x1000 the vectorized function performs roughly 100x faster, in my testing.
In my code I used NaN
's for masking, but with some extra lines of code you could also use numpy.ma
arrays. And I didn't have a nanmedian
function so I used nanmean
, performance should be comparable though.
import numpy as np
from numpy.lib.stride_tricks import as_strided
# test data
N = 1000
A = np.random.rand(N, N)*10
mask = np.random.choice([True, False], size=(N, N))
def filter_loop(A, mask):
kernel = np.array([[1,1,1],[1,0,1],[1,1,1]], bool)
A = A.copy()
A[mask] = np.nan
N = A.shape[0] - 2 # assuming square matrix
out = np.empty((N, N))
for i in xrange(N):
for j in xrange(N):
out[i,j] = np.nanmean(A[i:i+3, j:j+3][kernel])
return out
def filter_broadcast(A, mask):
A = A.copy()
A[mask] = np.nan
N = A.shape[0] - 2
B = as_strided(A, (N, N, 3, 3), A.strides+A.strides)
B = B.copy().reshape((N, N, 3*3))
B[:,:,4] = np.nan
return np.nanmean(B, axis=2)