I am trying to understand why the implementation of ResNet50 in Keras forbids images smaller than 32x32x3.
Based on their implementation: https://github.com/keras-team/keras-applications/blob/master/keras_applications/resnet50.py
The function that catches that is _obtain_input_shape
To overcome this problem, I made my own implementation based on their code and I removed the code that forbids minimal size. In my implementation I also add the possibility to work with pre-trained model with more than three channels by replicating the RGB weights for the first conv1 layer.
def ResNet50(load_weights=True,
input_shape=None,
pooling=None,
classes=1000):
img_input = Input(shape=input_shape, name='tuned_input')
x = ZeroPadding2D(padding=(3, 3), name='conv1_pad')(img_input)
# Stage 1 (conv1_x)
x = Conv2D(64, (7, 7),
strides=(2, 2),
padding='valid',
kernel_initializer=KERNEL_INIT,
name='tuned_conv1')(x)
x = BatchNormalization(axis=CHANNEL_AXIS, name='bn_conv1')(x)
x = Activation('relu')(x)
x = ZeroPadding2D(padding=(1, 1), name='pool1_pad')(x)
x = MaxPooling2D((3, 3), strides=(2, 2))(x)
# Stage 2 (conv2_x)
x = _convolution_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1))
for block in ['b', 'c']:
x = _identity_block(x, 3, [64, 64, 256], stage=2, block=block)
# Stage 3 (conv3_x)
x = _convolution_block(x, 3, [128, 128, 512], stage=3, block='a')
for block in ['b', 'c', 'd']:
x = _identity_block(x, 3, [128, 128, 512], stage=3, block=block)
# Stage 4 (conv4_x)
x = _convolution_block(x, 3, [256, 256, 1024], stage=4, block='a')
for block in ['b', 'c', 'd', 'e', 'f']:
x = _identity_block(x, 3, [256, 256, 1024], stage=4, block=block)
# Stage 5 (conv5_x)
x = _convolution_block(x, 3, [512, 512, 2048], stage=5, block='a')
for block in ['b', 'c']:
x = _identity_block(x, 3, [512, 512, 2048], stage=5, block=block)
# Condition on the last layer
if pooling == 'avg':
x = layers.GlobalAveragePooling2D()(x)
elif pooling == 'max':
x = layers.GlobalMaxPooling2D()(x)
inputs = img_input
# Create model.
model = models.Model(inputs, x, name='resnet50')
if load_weights:
weights_path = get_file(
'resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5',
WEIGHTS_PATH_NO_TOP,
cache_subdir='models',
md5_hash='a268eb855778b3df3c7506639542a6af')
model.load_weights(weights_path, by_name=True)
f = h5py.File(weights_path, 'r')
d = f['conv1']
# Used to work with more than 3 channels with pre-trained model
if input_shape[2] % 3 == 0:
model.get_layer('tuned_conv1').set_weights([d['conv1_W_1:0'][:].repeat(input_shape[2] / 3, axis=2),
d['conv1_b_1:0']])
else:
m = (3 * int(input_shape[2] / 3)) + 3
model.get_layer('tuned_conv1').set_weights(
[d['conv1_W_1:0'][:].repeat(m, axis=2)[:, :, 0:input_shape[2], :],
d['conv1_b_1:0']])
return model
I run my implementation with a 10x10x3 images and it seems to work. Thus I do not understand why they put this minimal bound.
They do not provide any information about this choice. I also check the original paper and I did not found any restriction mentioned about a minimal input shape. I suppose there is a reason for this bound but I do not know this one.
Thus I would like to know why such restriction has been done for the Resnet implementation.
ResNet50 has 5 stages of downsampling, between MaxPooling of 2x2 and Strided Convolution with strides of 2 px in each direction. This means that the minimum input size is 2^5 = 32, and this value is also the size of the receptive field.
There is not much point of using smaller images than 32x32, since then downsampling is not doing anything, and this will change the behavior of the network. For such small images then its better to use another network with less downsampling (like DenseNet) or with less depth.