A quick way to find the first matching submatrix from the matrix

My matrix is simple, like:

# python3 numpy
>>> A
array([[0., 0., 1., 1., 1.],
       [0., 0., 1., 1., 1.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])
>>> P
array([[0., 0., 0., 0.]])

I need to find an all-zero region(one is enough) in A with the same size as P (1x4). So the right answer include:

(2, 0)  # The vertex coordinates of the all-zero rectangular region that P can be matched
(2, 1)
(3, 0)
(3, 1)
(4, 0)
(4, 1)
# Just get any 1 answer

Actually my A matrix will reach a size of 30,000*30,000. I'm worried that it will be slow if written as a loop statement. Is there any quick way?

The size of P is uncertain, from 10*30 to 4000*80. At the same time, the A matrix lacks regularity, and looping from any point may require traversing the entire matrix to successfully match

Solution

As @Julien pointed out in the comment, in general, we can use sliding windows for this kind of task.

def find_all_zero_region_by_sliding_window(a, shape):
    x, y = np.nonzero(np.lib.stride_tricks.sliding_window_view(a, shape).max(axis=-1).max(axis=-1) == 0)
    return np.stack((x, y), axis=-1)


find_all_zero_region_by_sliding_window(A, P.shape)

However, unfortunately, this requires a lot of memory.

numpy.core._exceptions.MemoryError: Unable to allocate 11.3 TiB for an array with shape (26001, 29921, 4000) and data type float32
                                                       ^^^^^^^^

As an alternative, I think using the Summed-area table is a good idea.

It is similar to the sliding window approach above, but instead of finding the maximum value, we can calculate the sum (very efficiently) and search for the position where it is zero. Note that this assumes that A does not contain any negative values. Otherwise, you would have to use numpy.abs.

Since we do not need to be able to calculate the sum of any given position, I adapted this idea and implemented it to require only a single-line cache.

import numpy as np
from typing import Tuple


def find_all_zero_region(arr: np.ndarray, kernel_size: Tuple[int, int]) -> np.ndarray:
    input_height, input_width = arr.shape
    kernel_height, kernel_width = kernel_size

    matches = []

    # Calculate summed_line for y==0.
    summed_line = arr[:kernel_height].sum(axis=0)

    for y in range(input_height - kernel_height + 1):
        # Update summed_line for row y.
        if y != 0:  # Except y==0, which already calculated above.
            # Adding new row and subtracting old row.
            summed_line += arr[y + kernel_height - 1] - arr[y - 1]

        # Calculate kernel_sum for (y, 0).
        kernel_sum = summed_line[:kernel_width].sum()
        if kernel_sum == 0:
            matches.append((y, 0))

        # Calculate kernel_sum for (y, 1) to (y, right-edge).
        # Using the idea of a summed-area table, but in 1D (horizontally).
        (all_zero_region_cols,) = np.nonzero(kernel_sum + np.cumsum(summed_line[kernel_width:] - summed_line[:-kernel_width]) == 0)
        for col in all_zero_region_cols:
            matches.append((y, col + 1))

    if not matches:
        # For Numba, output must be a 2d array.
        return np.zeros((0, 2), dtype=np.int64)
    return np.array(matches, dtype=np.int64)

As you can see, this uses loops, but it should be much faster than you think because the required memory is relatively small and the number of calculations/comparisons is greatly reduced. Here is some timing code.

import time


rng = np.random.default_rng(0)
A = rng.integers(0, 2, size=(30000, 30000)).astype(np.float32)

P = np.zeros(shape=(4000, 80))

# Create an all-zero region in the bottom right corner which will be searched last.
A[-P.shape[0] :, -P.shape[1] :] = 0

started = time.perf_counter()
result = find_all_zero_region(A, P.shape)
print(f"{time.perf_counter() - started} sec")
print(result)
# 3.541154200000001 sec
# [[26000 29920]]

Moreover, this function can be even faster by using Numba. Just add annotations as follows:

import numba


@numba.njit("int64[:,:](float32[:,:],UniTuple(int64,2))")
def find_all_zero_region_with_numba(arr: np.ndarray, kernel_size: Tuple[int, int]) -> np.ndarray:
    ...

started = time.perf_counter()
find_all_zero_region_with_numba(A, P.shape)
print(f"{time.perf_counter() - started} sec")
# 1.6005743999999993 sec

Note that I implemented it to find all positions of the all-zero regions, but you can also make it return on the first one. Since it uses loops, the average execution time will be even faster.