How can I replace multiple instances of a pattern in a large 2D numpy array with a smaller 2D numpy array?
I am looking for a vectorized solution using a boolean mask to minimize the performance impact because the large array I'm working will be millions of rows long.
For example:
#Large array
largeArr = np.array([
[0, 1, 1],
[0, 1, 1],
[0, 1, 1],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 1, 1],
[0, 1, 1],
[0, 1, 1],
[0, 0, 0],
[0, 0, 0],
[3, 2, 0],
[3, 2, 0],
[3, 2, 0],
[3, 2, 0],
[0, 0, 0],
[0, 0, 0],
[3, 2, 0],
[3, 2, 0],
[3, 2, 0],
[3, 2, 0],
[0, 0, 0]
])
I would like to replace sections with 3 consecutive rows containing [0, 1, 1]
with
pattern1 = [
[0, 2, 1],
[0, 2, 2],
[0, 2, 3]
]
Then I would like to replace sections with 4 consecutive rows containing [3, 2, 0]
with
pattern2 = [
[5, 2, 1],
[5, 3, 2],
[5, 4, 3],
[5, 5, 4]
]
Expected result:
[[0, 2, 1],
[0, 2, 2],
[0, 2, 3],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 2, 1],
[0, 2, 2],
[0, 2, 3],
[0, 0, 0],
[0, 0, 0],
[5, 2, 1],
[5, 3, 2],
[5, 4, 3],
[5, 5, 4],
[0, 0, 0],
[0, 0, 0],
[5, 2, 1],
[5, 3, 2],
[5, 4, 3],
[5, 5, 4],
[0, 0, 0]]
There will be multiple patterns to find and replace, each with their own replacement array. The intent is to loop over the provided search rows and replacement patterns one at a time.
The search row is always a single row, repeated as many times as there are rows in the replacement pattern.
I assume that all your quantities are arrays, not lists. If that's not the case, wrap them in np.array
:
search = np.array([0, 1, 1])
pattern = np.array(pattern1)
The size of a block is given by
n = len(pattern) # or pattern.shape[0]
I assume that you want to replace only non-overlapping segments. So while six rows of search
make for exactly two instances of pattern
in the output, seven rows make for two instances of pattern
and an instance of search
.
Searching for a pattern is simple. Start by creating a mask of where rows match the pattern:
mask = (largeArr == search).all(1)
The idiom for finding contiguous runs of a mask has been beaten to death on this site. The gist is to use np.diff
to find where the sign of the mask changes, then np.flatnonzero
to get indices, and np.diff
again to compute run lengths. First pad the mask to ensure that the result includes endpoints correctly:
indices = np.flatnonzero(np.diff(np.r_[False, mask, False])).reshape(-1, 2)
runs = np.diff(indices, axis=1).squeeze()
Notice that indices
has been reshaped to two columns for convenience. The passing l padding guarantees that this is possible. The first column is the start of each run (inclusive), while the second is the end (exclusive). That makes computing the run lengths in runs
trivial.
Now you can adjust indices
to include only runs of size n
or longer, and trim the ending elements to be a multiple of n
away from the start elements:
# runs = n * (runs // n), but for huge arrays
np.floor_divide(runs, n, out=runs)
np.multiply(runs, n, out=runs)
indices[:, 1] = indices[:, 0] + runs
You could trim zero-length runs out of indices
using indices = indices[np.flatnonzero(runs)]
, but this is not necessary. The next step is to convert the adjusted indices
back into a mask:
mask = np.zeros_like(mask, dtype=np.int8)
The np.uint8
dtype allows you to store +1 and -1 in the mask, and is the same size as np.bool_
, which means that if done properly, the final result can be seamlessly viewed as a boolean mask:
starts, ends = indices.T
if ends[-1] == mask.size:
ends = ends[:-1]
mask[starts] = 1
mask[ends] -= 1 # This takes care of zero-length segments automatically
mask = np.cumsum(mask, out=mask).view(bool)
The extra processing for ends
, which is unpacked as the second column of indices
, takes care of the case where the mask runs to the end of the array. Since end indices are exclusive, this would be past the end of the array, but that also means that that run does not need to be terminated at all.
Now that your mask has been filtered and trimmed, you can assign to the correct rows in largeArr
. The simplest way is to repeat pattern
as many times as necessary:
largeArr[mask, :] = np.tile(pattern, [runs.sum() // n, 1])
If you package this into a function, you can run it for multiple patterns:
def replace_pattern(arr, search, pattern):
n = len(pattern)
mask = (arr == search).all(1)
indices = np.flatnonzero(np.diff(np.r_[False, mask, False])).reshape(-1, 2)
runs = np.diff(indices, axis=1).squeeze()
np.floor_divide(runs, n, out=runs)
np.multiply(runs, n, out=runs)
indices[:, 1] = indices[:, 0] + runs
mask = np.zeros_like(mask, dtype=np.int8)
starts, ends = indices.T
if ends[-1] == mask.size:
ends = ends[:-1]
mask[starts] = 1
mask[ends] -= 1
mask = np.cumsum(mask, out=mask).view(bool)
arr[mask, :] = np.tile(pattern, [runs.sum() // n, 1])
replacements = [
([0, 1, 1], [[0, 2, 1],
[0, 2, 2],
[0, 2, 3]]),
([3, 2, 0], [[5, 2, 1],
[5, 3, 2],
[5, 4, 3],
[5, 5, 4]])
]
largeArr = np.array(...)
for search, pattern in replacements:
replace_pattern(largeArr, search, pattern)