Search code examples
pythonnumpymultidimensional-arrayvectorizationmasking

Replacing sections of 2D array with a smaller 2D array using masks


How can I replace multiple instances of a pattern in a large 2D numpy array with a smaller 2D numpy array?

I am looking for a vectorized solution using a boolean mask to minimize the performance impact because the large array I'm working will be millions of rows long.

For example:

#Large array
largeArr = np.array([
    [0, 1, 1],
    [0, 1, 1],
    [0, 1, 1],
    [0, 0, 0],
    [0, 0, 0],
    [0, 0, 0],
    [0, 1, 1],
    [0, 1, 1],
    [0, 1, 1],
    [0, 0, 0],
    [0, 0, 0],
    [3, 2, 0],
    [3, 2, 0],
    [3, 2, 0],
    [3, 2, 0],
    [0, 0, 0],
    [0, 0, 0],
    [3, 2, 0],
    [3, 2, 0],
    [3, 2, 0],
    [3, 2, 0],
    [0, 0, 0]
])

I would like to replace sections with 3 consecutive rows containing [0, 1, 1] with

pattern1 = [
    [0, 2, 1],
    [0, 2, 2],
    [0, 2, 3]
]

Then I would like to replace sections with 4 consecutive rows containing [3, 2, 0] with

pattern2 = [
    [5, 2, 1],
    [5, 3, 2],
    [5, 4, 3],
    [5, 5, 4]
]

Expected result:

[[0, 2, 1],
 [0, 2, 2],
 [0, 2, 3],
 [0, 0, 0],
 [0, 0, 0],
 [0, 0, 0],
 [0, 2, 1],
 [0, 2, 2],
 [0, 2, 3],
 [0, 0, 0],
 [0, 0, 0],
 [5, 2, 1],
 [5, 3, 2],
 [5, 4, 3],
 [5, 5, 4],
 [0, 0, 0],
 [0, 0, 0],
 [5, 2, 1],
 [5, 3, 2],
 [5, 4, 3],
 [5, 5, 4],
 [0, 0, 0]]

There will be multiple patterns to find and replace, each with their own replacement array. The intent is to loop over the provided search rows and replacement patterns one at a time.

The search row is always a single row, repeated as many times as there are rows in the replacement pattern.


Solution

  • I assume that all your quantities are arrays, not lists. If that's not the case, wrap them in np.array:

    search = np.array([0, 1, 1])
    pattern = np.array(pattern1)
    

    The size of a block is given by

    n = len(pattern)  # or pattern.shape[0]
    

    I assume that you want to replace only non-overlapping segments. So while six rows of search make for exactly two instances of pattern in the output, seven rows make for two instances of pattern and an instance of search.

    Searching for a pattern is simple. Start by creating a mask of where rows match the pattern:

    mask = (largeArr == search).all(1)
    

    The idiom for finding contiguous runs of a mask has been beaten to death on this site. The gist is to use np.diff to find where the sign of the mask changes, then np.flatnonzero to get indices, and np.diff again to compute run lengths. First pad the mask to ensure that the result includes endpoints correctly:

    indices = np.flatnonzero(np.diff(np.r_[False, mask, False])).reshape(-1, 2)
    runs = np.diff(indices, axis=1).squeeze()
    

    Notice that indices has been reshaped to two columns for convenience. The passing l padding guarantees that this is possible. The first column is the start of each run (inclusive), while the second is the end (exclusive). That makes computing the run lengths in runs trivial.

    Now you can adjust indices to include only runs of size n or longer, and trim the ending elements to be a multiple of n away from the start elements:

    # runs = n * (runs // n), but for huge arrays
    np.floor_divide(runs, n, out=runs)
    np.multiply(runs, n, out=runs)
    
    indices[:, 1] = indices[:, 0] + runs
    

    You could trim zero-length runs out of indices using indices = indices[np.flatnonzero(runs)], but this is not necessary. The next step is to convert the adjusted indices back into a mask:

    mask = np.zeros_like(mask, dtype=np.int8)
    

    The np.uint8 dtype allows you to store +1 and -1 in the mask, and is the same size as np.bool_, which means that if done properly, the final result can be seamlessly viewed as a boolean mask:

    starts, ends = indices.T
    if ends[-1] == mask.size:
        ends = ends[:-1]
    mask[starts] = 1
    mask[ends] -= 1  # This takes care of zero-length segments automatically
    mask = np.cumsum(mask, out=mask).view(bool)
    

    The extra processing for ends, which is unpacked as the second column of indices, takes care of the case where the mask runs to the end of the array. Since end indices are exclusive, this would be past the end of the array, but that also means that that run does not need to be terminated at all.

    Now that your mask has been filtered and trimmed, you can assign to the correct rows in largeArr. The simplest way is to repeat pattern as many times as necessary:

    largeArr[mask, :] = np.tile(pattern, [runs.sum() // n, 1])
    

    If you package this into a function, you can run it for multiple patterns:

    def replace_pattern(arr, search, pattern):
        n = len(pattern)
        mask = (arr == search).all(1)
        indices = np.flatnonzero(np.diff(np.r_[False, mask, False])).reshape(-1, 2)
        runs = np.diff(indices, axis=1).squeeze()
        np.floor_divide(runs, n, out=runs)
        np.multiply(runs, n, out=runs)
        indices[:, 1] = indices[:, 0] + runs
        mask = np.zeros_like(mask, dtype=np.int8)
        starts, ends = indices.T
        if ends[-1] == mask.size:
            ends = ends[:-1]
        mask[starts] = 1
        mask[ends] -= 1
        mask = np.cumsum(mask, out=mask).view(bool)
        arr[mask, :] = np.tile(pattern, [runs.sum() // n, 1])
    
    replacements = [
        ([0, 1, 1], [[0, 2, 1],
                     [0, 2, 2],
                     [0, 2, 3]]),
        ([3, 2, 0], [[5, 2, 1],
                     [5, 3, 2],
                     [5, 4, 3],
                     [5, 5, 4]])
    ]
    
    largeArr = np.array(...)
    
    for search, pattern in replacements:
        replace_pattern(largeArr, search, pattern)