Search code examples
pythonkerasclassificationlstmconv-neural-network

Is a pooling layer mandatory when using CNN with LSTM in keras?


I am using CNN+LSTM for some binary classification problem. My code is as follows.

def create_network():
    model = Sequential()
    model.add(Conv1D(200, kernel_size=2, activation = 'relu', input_shape=(35,6)))
    model.add(Conv1D(200, kernel_size=2, activation = 'relu'))
    model.add(MaxPooling1D(3))
    model.add(LSTM(200, return_sequences=True))
    model.add(LSTM(200, return_sequences=True))
    model.add(LSTM(200))
    model.add(Dense(100))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

When I use the above model I get some bad results. However, when I remove the layer model.add(MaxPooling1D(3)) the results were somewhat improved.

My questions are as follows.

  • Is it mandatory to have a pooling layer when CNN is used with LSTM (since I am also using a dropout layer)?
  • If it is mandatory, what are the other kinds of pooling layers that you would suggest?

I am happy to provide more details if needed.


Solution

  • Firstly, you don't have to use a MaxPooling1D layer. MaxPooling here will only reduce the amount of inputs passed on to the LSTM (in this case). From a pure technical point of view, LSTMs can work with any sequence length, and keras automatically sets the right amount of input features

    There are some interesting things going on here though, that you might want to take a look at:

    1. It's hard to say some pooling mechanism would work better than another. However, the intuition is that max pooling works better on inferencing from extreme cases, while average pooling works better on ignoring the extremeties.

    2. You left the strides implicit, and it should be noted that the default stride value for pooling and convolution layer is different (None vs 1). This means that comparing the network with and without the max pooling is not exactly comparing apples to apples, as you greatly reduced the amount of data the LSTM layers would get.