python-2.7 keras conv-neural-network dimension

How to work with variable-sized image in CNNs using Keras?

I am currently working on CNN on a image for feature extraction using keras. All the images as 276 rows, x columns and 3 color dimensions (RGB). The number of columns is equal to the length of the output feature vector it should generate.

Input data representation - edit:

The input data given to the image consist of columnwise slices of the image. which means the actual input to the image is (276,3) and number of of columns is equal to the feature length it should generate.

My initial model is as such:

    print "Model Definition"
    model = Sequential()

    model.add(Convolution2D(64,row,1,input_shape=(row,None,3)))
    print model.output_shape
    model.add(MaxPooling2D(pool_size=(1,64)))
    print model.output_shape
    model.add(Dense(1,activation='relu'))

My prints in between prints the output.shape,and I seem to be bit confused on the output.

Model Definition
(None, 1, None, 64)
(None, 1, None, 64)

How come is the 3D data become 4d? And keeps being that after the maxpoolling2d layer?.

My dense layer/fully-connected layer is giving me some problems with the dimensions here:

Traceback (most recent call last):
  File "keras_convolutional_feature_extraction.py", line 466, in <module>
    model(0,train_input_data,output_data_train,test_input_data,output_data_test)
  File "keras_convolutional_feature_extraction.py", line 440, in model
    model.add(Dense(1,activation='relu'))
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 324, in add
    output_tensor = layer(self.outputs[0])
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 474, in __call__
    self.assert_input_compatibility(x)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 415, in assert_input_compatibility
    str(K.ndim(x)))
Exception: Input 0 is incompatible with layer dense_1: expected ndim=2, found ndim=4

So why am i not able to get the data down to a 1 single value from a 3D image. ?

Solution

You are operating on a 276 x None x 3 image using 64 convolutional filters, each of size 276 x 1 (assuming rows = 276). One convolutional filter will output a matrix of size 1 x None. Read this in detail if you do not know how convolutional filters work. So for 64 filters, (in Theano backend) you will get a matrix of size 64 x 1 x None. In Tensorflow backend, I think it will be 1 x None x 64. Now, first dimension for Keras-Theano is always samples. So, your final output shape will be None x 64 x 1 x None. For Tensorflow, it will be None x 1 x None x 64. Read this for more information on different backends in Keras.

To remove the dense layer error, I think you will need to flatten the output by introducing the following line before adding Dense layer.

model.add(Flatten())

However, I do not really understand the use of dense layer here. As you must be aware, the dense layer only accepts a fixed input size and provides a fixed size output. So your None dimension will be basically restricted to a single value if you want your network to run without throwing errors. If you want to have an output of the shape 1 x None, then you should not include dense layers and use average pooling at the end to collapse the response to a 1 x 1 x None output.

Edit: If you have an image of size 276 x n x 3, where it has variable number of columns and if you want an output of size 1 x n, then you can do as follows:

model = Sequential()
model.add(Convolution2D(64,row,1,input_shape=(row,None,3)))
model.add(Convolution2D(1,1,1))
print model.output_shape  # this should print `None x 1 x None x 1`
model.add(flatten())

Now, I doubt this network will perform very well since it has only one layer of 64 filters. The receptive field is also too large (e.g. 276 - height of the image). You can do two things:

Reduce the receptive field, i.e. instead of convolving the entire column of the image at once, you can convolve only 3 pixels of a column at a time.
Have multiple convolutional layers.

In the following, I will assume that the image height is 50. Then you can write a network as follows:

model = Sequential()
model.add(Convolution2D(32,3,1,activation='relu',
          init='he_normal',input_shape=(row,None,3)))  # row = 50
model.add(Convolution2D(32,3,1,activation='relu',init='he_normal'))
model.add(MaxPooling2D(pool_size=(2,1), strides=(2,1), name='pool1'))
model.add(Convolution2D(64,3,1,activation='relu',init='he_normal'))
model.add(Convolution2D(64,3,1,activation='relu',init='he_normal'))
model.add(MaxPooling2D(pool_size=(2,1), strides=(2,1), name='pool2'))
model.add(Convolution2D(128,3,1,activation='relu',init='he_normal'))
model.add(Convolution2D(128,3,1,activation='relu',init='he_normal'))
model.add(Convolution2D(128,3,1,activation='relu',init='he_normal'))
model.add(MaxPooling2D(pool_size=(2,1), strides=(2,1), name='pool3'))
model.add(Convolution2D(1,1,1), name='squash_channels')
print model.output_shape  # this should print `None x 1 x None x 1`
model.add(flatten(), name='flatten_input')

You should verify that all these convolutional and max-pooling layers are reducing the input height from 50 to 1 after the last max-pooling.

How to handle variable-sized images

One way is to first determine a common size for your dataset, e.g. 224. Then construct the network for 224 x n image as shown above (maybe a little deeper). Now let us say you get an image with a different size, say, p x n' where p > 224 and n' != n. You can take a center-crop of image of size 224 x n' and pass it through the image. You have your feature vector.

If you think that majority of the information is not concentrated around the center, then you can take multiple crops, and then average (or max-pool) the multiple feature vector obtained. Using these methods, I think you should be able to handle variable-sized inputs.

Edit:

See the CNN that I defined using 3 x 3 convolutions. Assume that the input is of size 50 x n x 3. Let us say that we pass an input of size p x q x r through a convolutional layer which has f filters, each of size 3 x 3, stride 1. The input has no padding. Then the output of convolutional layer will be of size (p-2) x (q-2) x f i.e. the output height and width will be the two less than that of input. Our pooling layers are of size (2,1) and stride (2,1). They will halve the input in the y-direction (or halve the image height). Keeping this in mind, the following is straightforward to derive (observe the layer names I have given in my CNN, they are referenced below).

CNN input: None x 50 x n x 3

Input of pool1 layer: None x 46 x n x 32
Output of pool1 layer: None x 23 x n x 32

Input of pool2 layer: None x 19 x n x 64
Output of pool2 layer: None x 9 x n x 64 (I think Keras pooling takes floor i.e. floor(19/2) = 9)

Input of pool3 layer: None x 3 x n x 128
Output of pool3 layer: None x 1 x n x 128

Input of squash_channels layer: None x 1 x n x 128
Output of squash_channels layer: None x 1 x n x 1

Input of flatten_input layer: None x 1 x n x 1
Output of flatten_input layer: None x n

I think this is what you wanted. I hope its clear now.