Search code examples
pythonarraysnumpyrasterndimage

Select size filtered elements in a large array (raster)


I could need help on this:

On a large boolean numpy array (imported raster) (2000x2000), I try to select only elements that are greater than 800 units. (number of total elements > 1000)

I tried a loop:

labeled_array, num_features = scipy.ndimage.label(my_np_array, structure = None, output = np.int)

print num_features

RasterYSize, RasterXSize = my_np_array.shape
big_zones = np.zeros((RasterYSize, RasterXSize), dtype=np.bool)

print "Iterating in progress"
# Improvement could be needed to reduce the number of loops
for i in range(1, num_features):
    zone_array = (labeled_array == i)
    zone = np.sum(zone_array)
    if zone > 800:
        big_zones += zone_array

But I'm sure there's a better way to do this.


Solution

  • Here's a vectorized approach based on binning with np.bincount and performing the cumulative ORing at big_zones += zone_array with np.in1d -

    from scipy.ndimage import label
    
    # Label with scipy
    labeled_array, num_features = label(my_np_array, structure = None, output = np.int)
    
    # Set the threshold
    thresh = 800
    
    # Get the binned counts with "np.bincount" and check against threshold
    matches = np.bincount(labeled_array.ravel())>thresh
    
    # Get the IDs corresponding to matches and get rid of the starting "0" and 
    # "num_features", as you won't have those in "range(1, num_features)" either
    match_feat_ID = np.nonzero(matches)[0]
    valid_match_feat_ID = np.setdiff1d(match_feat_ID,[0,num_features])
    
    # Finally, use "np.in1d" to do ORing operation corresponding to the iterative 
    # "big_zones += zone_array" operation on the boolean array "big_zones". 
    # Since "np.in1d"  works with 1D arrays only, reshape back to 2D shape
    out = np.in1d(labeled_array,valid_match_feat_ID).reshape(labeled_array.shape)
    

    Runtime tests and verify outputs

    Function definitions -

    def original_app(labeled_array,num_features,thresh):
        big_zones = np.zeros((my_np_array.shape), dtype=np.bool)
        for i in range(1, num_features):
            zone_array = (labeled_array == i)
            zone = np.sum(zone_array)
            if zone > thresh:
                big_zones += zone_array
        return big_zones
    
    def vectorized_app(labeled_array,num_features,thresh):
        matches = np.bincount(labeled_array.ravel())>thresh
        match_feat_ID = np.nonzero(matches)[0]
        valid_match_feat_ID = np.setdiff1d(match_feat_ID,[0,num_features])
        return np.in1d(labeled_array,valid_match_feat_ID).reshape(labeled_array.shape)
    

    Timings and output verification -

    In [2]: # Inputs
       ...: my_np_array = np.random.rand(200,200)>0.5
       ...: labeled_array, num_features = label(my_np_array, structure = None, output = np.int)
       ...: thresh = 80
       ...: 
    
    In [3]: out1 = original_app(labeled_array,num_features,thresh)
    
    In [4]: out2 = vectorized_app(labeled_array,num_features,thresh)
    
    In [5]: np.allclose(out1,out2)
    Out[5]: True
    
    In [6]: %timeit original_app(labeled_array,num_features,thresh)
    1 loops, best of 3: 407 ms per loop
    
    In [7]: %timeit vectorized_app(labeled_array,num_features,thresh)
    100 loops, best of 3: 2.5 ms per loop