Search code examples
pythonnumpy

Numpy: vectorizing the comparison of 2 boolean arrays to derive start and end indices of regions of interest


Assuming I have 2 boolean arrays of same size, is_overlap and is_incomplete. I would like to retrieve the start and end indices (end indices excluded) of regions of interest in these arrays (these regions are of interest for an analysis achieved in subsequent steps of the algo - start and end indices excluded will be used to slice a 3rd array of same size).

To identify these regions, this set of conditions need to be fulfilled.

  • 1/ a region of interest contains at least one True row in is_overlap,
  • 2/ a region of interest may further extend to contiguous rows (be it False or True in is_overlap)
    • if they are contiguous True rows in is_incomplete,
    • if the total number of rows in the extended region is larger than or equal to max_n_rows
    • the extended region is the one accounting for contiguous is_incomplete rows, and also contiguous regions (regions are merged together)

Example

import numpy as np

max_n_rows    = 4
# row indices             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
is_overlap    = np.array([0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1], dtype=bool)
is_incomplete = np.array([1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1], dtype=bool)

Result expected is then,

# region indices                         0  1   2   3   4
region_of_interest_starts    = np.array([1, 3,  9, 14, 16])
region_of_interest_ends_excl = np.array([2, 8, 13, 15, 17])

Explanations:

  • region 0:

    • condition 1/ region contains the row 1 because is_overlap[1] = True
    • condition 2/ is not achieved. Region does not extend to row 0 even if is_incomplete[0] = True, because number of rows in this region would then be 2, which is lower than max_n_rows.
  • region 1:

    • condition 1/ region contains the 4 and 5 because is_overlap[4] = is_overlap[5] = True
    • condition 2/ is achieved. Region extends to contiguous rows 3, 6 and 7 because is_incomplete[3] = is_incomplete[6] = is_incomplete[7] = True) and total number of rows in this region is then 5, which is larger or equal than max_n_rows.
  • region 2:

    • condition 1/ region 2 is actually made of 2 separate True in is_overlap at rows 9 and 11.
    • condition 2/ these 2 separate rows are however linked by the contiguous is_incomplete[10] = True. The total number of rows of this region is then 4 (made by rows 9, 10, 11, 12). Because it is larger than or equal to max_n_rows, merging is achieved.
  • regions 3 and 4:

    • start is similar to region 2. However, following the same logic, the total number of rows would then be 3 (made by rows 14, 15, 16) which is lower than max_n_rows. Merging with contiguous rows in is_incomplete is then not achieved. Only condition 1/ applies, resulting in 2 separate regions.

Solution

  • Here is my attempt as an answer. I am frankly not sure this is optimal, and the list of intervals returned at the end is unordered.

    import numpy as np
    
    
    def get_indices_region(mask: np.ndarray) -> np.ndarray:
        """Computes the start and end indices of each connected components in `mask`.
        Taken from https://stackoverflow.com/questions/68514880/finding-contiguous-regions-in-a-1d-boolean-array.
    
        Args:
            mask (np.ndarray): A 1d numpy array of dtype `bool`.
    
        Returns:
            A numpy array containing the list of the start and end indices of each connected components in `mask`.
        """
    
        return np.flatnonzero(np.diff(np.r_[np.int8(0), mask.astype(np.int8), np.int8(0)])).reshape(-1, 2)
    
    
    def your_function(is_overlap: np.ndarray, is_incomplete: np.ndarray, max_n_rows: int = 4) -> np.ndarray:
        # Step 1: compute the enlarged candidate regions
        indices_enlarged = get_indices_region(is_overlap | is_incomplete)
    
        # Step 2: Filter out enlarged candidates based on their length
        lengths_enlarged = indices_enlarged[:, 1] - indices_enlarged[:, 0]
        indices_enlarged = indices_enlarged[lengths_enlarged >= max_n_rows]
    
        # Step 3: compute the reduced candidate regions
        indices_reduced = get_indices_region(is_overlap)
    
        # Step 4: Filter out reduced candidates based on their overlap with the enlarged regions
        mask_overlap = np.any(
            (indices_enlarged[:, None, 0] <= indices_reduced[None, :, 0])
            & (indices_enlarged[:, None, 1] >= indices_reduced[None, :, 1]),
            axis=0
        )
        indices_reduced = indices_reduced[~mask_overlap]
    
        return np.vstack((indices_reduced, indices_enlarged))
    

    For your example, we get:

    max_n_rows = 4
    is_overlap = np.array([0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1], dtype=bool)
    is_incomplete = np.array([1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1], dtype=bool)
    
    your_function(is_overlap, is_incomplete, max_n_rows)
    
    >>> [[1   2]
         [14 15]
         [16 17]
         [3   8]
         [9  13]]