Assuming I have 2 boolean arrays of same size, is_overlap
and is_incomplete
.
I would like to retrieve the start and end indices (end indices excluded) of regions of interest in these arrays (these regions are of interest for an analysis achieved in subsequent steps of the algo - start and end indices excluded will be used to slice a 3rd array of same size).
To identify these regions, this set of conditions need to be fulfilled.
True
row in is_overlap
,False
or True
in is_overlap
)
True
rows in is_incomplete
,max_n_rows
is_incomplete
rows, and also contiguous regions (regions are merged together)Example
import numpy as np
max_n_rows = 4
# row indices 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
is_overlap = np.array([0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1], dtype=bool)
is_incomplete = np.array([1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1], dtype=bool)
Result expected is then,
# region indices 0 1 2 3 4
region_of_interest_starts = np.array([1, 3, 9, 14, 16])
region_of_interest_ends_excl = np.array([2, 8, 13, 15, 17])
Explanations:
region 0:
is_overlap[1] = True
is_incomplete[0] = True
, because number of rows in this region would then be 2, which is lower than max_n_rows
.region 1:
is_overlap[4] = is_overlap[5] = True
is_incomplete[3] = is_incomplete[6] = is_incomplete[7] = True
) and total number of rows in this region is then 5, which is larger or equal than max_n_rows
.region 2:
True
in is_overlap
at rows 9 and 11.is_incomplete[10] = True
. The total number of rows of this region is then 4 (made by rows 9, 10, 11, 12). Because it is larger than or equal to max_n_rows
, merging is achieved.regions 3 and 4:
max_n_rows
. Merging with contiguous rows in is_incomplete
is then not achieved. Only condition 1/ applies, resulting in 2 separate regions.Here is my attempt as an answer. I am frankly not sure this is optimal, and the list of intervals returned at the end is unordered.
import numpy as np
def get_indices_region(mask: np.ndarray) -> np.ndarray:
"""Computes the start and end indices of each connected components in `mask`.
Taken from https://stackoverflow.com/questions/68514880/finding-contiguous-regions-in-a-1d-boolean-array.
Args:
mask (np.ndarray): A 1d numpy array of dtype `bool`.
Returns:
A numpy array containing the list of the start and end indices of each connected components in `mask`.
"""
return np.flatnonzero(np.diff(np.r_[np.int8(0), mask.astype(np.int8), np.int8(0)])).reshape(-1, 2)
def your_function(is_overlap: np.ndarray, is_incomplete: np.ndarray, max_n_rows: int = 4) -> np.ndarray:
# Step 1: compute the enlarged candidate regions
indices_enlarged = get_indices_region(is_overlap | is_incomplete)
# Step 2: Filter out enlarged candidates based on their length
lengths_enlarged = indices_enlarged[:, 1] - indices_enlarged[:, 0]
indices_enlarged = indices_enlarged[lengths_enlarged >= max_n_rows]
# Step 3: compute the reduced candidate regions
indices_reduced = get_indices_region(is_overlap)
# Step 4: Filter out reduced candidates based on their overlap with the enlarged regions
mask_overlap = np.any(
(indices_enlarged[:, None, 0] <= indices_reduced[None, :, 0])
& (indices_enlarged[:, None, 1] >= indices_reduced[None, :, 1]),
axis=0
)
indices_reduced = indices_reduced[~mask_overlap]
return np.vstack((indices_reduced, indices_enlarged))
For your example, we get:
max_n_rows = 4
is_overlap = np.array([0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1], dtype=bool)
is_incomplete = np.array([1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1], dtype=bool)
your_function(is_overlap, is_incomplete, max_n_rows)
>>> [[1 2]
[14 15]
[16 17]
[3 8]
[9 13]]