tensorflow math keras multidimensional-array image-segmentation

Why the input width and height should be divisible by 32 for the U-Net segmentation models?

I am using the Qubvel segmentation models https://github.com/qubvel/segmentation_models with Keras backend to train on a medical binary segmentation problem. I am fine with training the model with input images and masks of spatial dimensions 256 x 224, 256 x 256, 512 x 480, 512 x 512, and other values, as long as the width and height are divisible by 32. Otherwise, the models do not train. What is the mathematical definition behind this rule of divisibility of the input width and height by 32?

Solution

Unet architecture (from the original paper, shown below) works by downsampling layers in encoder and upsampling layers in decoder. It has 5 such layers. Choosing a multiple of 32 (2^5) makes sure the downsampling and upsampling process results in the same resolution for input and output, so that loss can be calculated at pixel level.

Having said that, if you want to make it work for a different input size, you just need to make sure that the decoder (by padding or other means) returns the same size as the encoder output at each intermediate level (for skip connection) as well as the output image.