Trouble with layers shapes in Keras

I am building an autoencoder with keras for denoising purpose and I have an issues with the shape of the model.

Here the model:

inputs = layers.Input(shape=(129, 87, 1))

# Encoder
x = layers.Conv2D(32, (3, 3), activation="relu", padding="same")(inputs)
x = layers.MaxPooling2D((2, 2), padding="same")(x)
x = layers.Conv2D(32, (3, 3), activation="relu", padding="same")(x)
x = layers.MaxPooling2D((2, 2), padding="same")(x)

# Decoder
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(1, (3, 3), activation="sigmoid", padding="same")(x)

# Autoencoder
autoencoder = Model(inputs, x)
autoencoder.compile(optimizer="adam", loss="binary_crossentropy")
autoencoder.summary()

The input image have a shape of 129x87, but in the model summary I have:

Model: "model_21"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_23 (InputLayer)       [(None, 129, 87, 1)]      0         
                                                                 
 conv2d_88 (Conv2D)          (None, 129, 87, 32)       320       
                                                                 
 max_pooling2d_38 (MaxPoolin  (None, 65, 44, 32)       0         
 g2D)                                                            
                                                                 
 conv2d_89 (Conv2D)          (None, 65, 44, 32)        9248      
                                                                 
 max_pooling2d_39 (MaxPoolin  (None, 33, 22, 32)       0         
 g2D)                                                            
                                                                 
 conv2d_transpose_12 (Conv2D  (None, 66, 44, 32)       9248      
 Transpose)                                                      
                                                                 
 conv2d_transpose_13 (Conv2D  (None, 132, 88, 32)      9248      
 Transpose)                                                      
                                                                 
 conv2d_90 (Conv2D)          (None, 132, 88, 1)        289       
                                                                 
=================================================================
Total params: 28,353
Trainable params: 28,353
Non-trainable params: 0
_________________________________________________________________```

We can see that the last layer has a shape of (132, 88) and not (129, 87). What am I missing?

Solution

The size mismatch is because your input image sizes are not divisible by 4.

Why 4? Because you have two MaxPooling2D layers with 2x2 windows, each of these layers divides the image height and width by 2 (due to the window size). (2x reduction on each layer)^(2 layers) = 4. If there was a third of such layers, you'd want your image sizes to be divisible by 8 = 2^3. Another example, if your two MaxPooling2D layers were using 3x3 windows, then you'd want the image sizes to be divisible by 3^2 = 9.

When an image size is not divisible by the window size, the size after the MaxPooling2D layer are rounded up so as to include the pixels at the edges even when they don't fit inside a whole window. Notice in your example, that ceil(129 / 2) = 65. However, Conv2DTranspose scales the image size exactly as a multiple of the strides. But since the original image did not scale down exactly this causes the mismatch.

Solution: Trim or resize your input images to sizes that are divisible by 4 in this case.

This divisibility requirement is the main reason why most papers usually use image sizes that are powers of 2 (e.g., 32, 64, 128, ...). This way, you can apply however many MaxPooling layers with 2x2 windows without any issues.

This is beyond your original question but your architecture is not a good autoencoder. The reason is that, since you maintain the "spatial" structure at the end of the encoder, that representation still retains a lot of redundancy and is suboptimal regarding the number of dimensions. You should flatten the output of the second MaxPooling and apply a Dense that compresses it into your final latent space embedding. And then reverse the operations on the decoder side.