python numpy tensorflow keras conv-neural-network

Image folding by dividing into grid for convolutional neural network

Suppose I have a 300 x 300 input image with 1 channel, contained in a numpy array with shape (300, 300, 1). And the channel is single bit - either 0 or 1.

How can I divide it into a 4 x 4 grid, each grid being 75 by 75 pixels wide and stack the grids together by summing up the bits?

In the end, I'd have a single numpy array that's (75, 75, 1). The value of the last channel can range from 0 to 16 at this point.

How well would this work as an input to a convolutional neural network? Is this an effective way of shrinking my input?

Solution

You can do it using https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.as_strided.html

#mocking a single channel binary image
SIZE = 300
X = np.random.randint(low=0, high=2, size=(SIZE,SIZE,1))

BLOCK_SIZE = 4
assert SIZE % BLOCK_SIZE == 0
# defining the moving window by its strides
new_stride = [X.strides[0], X.strides[1], X.strides[0]*BLOCK_SIZE, X.strides[1]*BLOCK_SIZE, X.strides[2]]
coarsened_X = np.sum( np.lib.stride_tricks.as_strided(X, shape=[BLOCK_SIZE, BLOCK_SIZE, SIZE//BLOCK_SIZE, SIZE//BLOCK_SIZE, 1], strides=new_stride), axis=(0,1))

About your final question: yeah I do think it's a relevant way of shrinking the input of your CNN, depending on how much training data you have available it can be more efficient that a first trainable convolutional layer on a very large image. Note that it's preferable to average your inputs instead of summing them for optimization purposes.

A benchmark showing it's ~50x times faster for an array of your size:

edit:

I just tried this method of doing it using only reshape and orient transform, possibly less readable and/or generalizable:

X.reshape(BLOCK_SIZE, SIZE//BLOCK_SIZE, BLOCK_SIZE, SIZE//BLOCK_SIZE, 1, order='F').sum(axis=(0,2))

I am suggesting it as it seems to be a few percent faster than the stride way. Just in case of interest.