Search code examples
pythontensorflowkeraslstmkeras-layer

Image sequence processing ConvLSTM vs LSTM architecture in Keras


I need to train a sequence based segmentation of 10x10 images. Following are the lstm and convlstm models that I want to use:

def lstmModel():
    # Model definition
    model = Sequential()
    model.add(LSTM(50, batch_input_shape=(1, None, inp.shape[1]*inp.shape[2]), return_sequences=True, stateful=True))
    model.add(Dense(out.shape[1]*out.shape[2], activation='softmax'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model


def convlstmModel():
    # Model definition
    model = Sequential()
    model.add(ConvLSTM2D(12, kernel_size=5, padding = "same", batch_input_shape=(1, None, inp.shape[1], inp.shape[2], 1), return_sequences=True, stateful=True))
    model.add(Conv2D(20, 3, padding='same', activation='softmax'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

I train the models for sequence of 10 random 10x10 image sequences. LSTM model seems to work fine for me, but ConvLSTM model shows dimension mismatch for the Conv2D layer:

ValueError: Input 0 is incompatible with layer conv2d_1: expected ndim=4, found ndim=5

Any help is really appreciated. Thanks!


Solution

  • LSTM layers are meant for "time sequences".
    Conv layers are meant for "still images".

    One requires shapes like (batch, steps, features)
    The other requires: (batch, witdh, height, features)

    Now, ConvLSTM2D mixes both and requires (batch, steps, width, height, features)

    When leaving the ConvLSTM2D you have an extra steps dimension not supported by the Conv2D.

    If you want to keep this dimension, use the convolution with a TimeDistributed wrapper:

    model.add(TimeDistributed(Conv2D(...))
    

    Notice that you will still have all 5 dimensions, in contranst with your other model that has only 3.

    You should use some kind of reshape or other operations to make it suitable to your training data.

    Since your question doesn't show anything about it, that's all we can answer for now.