keras lstm recurrent-neural-network tf.keras

Meaning of 2D input in Keras LSTM

In Keras, LSTM is in the shape of [batch, timesteps, feature]. What if I indicate the input as keras.Input(shape=(20, 1)) and feed a matrix of (100, 20, 1) as input? What's the number of batch that it's considering in this case? Is the batch size 100 with 20 time stems in each batch?

Solution

TL;DR

The batch, timestep, features in your case is defined as None, 20, 1, where the batch represents the batch_size parameter passed during model.fit. The model does not need to know this before hand. Therefore, when you define your input layer (or your LSTM layer's input shape), you simply defined (timesteps, features) which is (20, 1). A simple model.summary() would show you that that input size is translated to (None, 20, 1) while creating the computation graph.

Deeper dive into the subject

A good way to understand whats going on is to simply print the summary of your model. Let me take a simple example here and walk you through the steps -

#Creating a simple stacked LSTM model

from tensorflow.keras import layers, Model
import numpy as np

inp = layers.Input((20,1))                       #<------
x = layers.LSTM(5, return_sequences=True)(inp)
x = layers.LSTM(4)(x)
out = layers.Dense(1, activation='sigmoid')(x)

model = Model(inp, out)
model.compile(loss='binary_crossentropy')
model.summary()

Model: "model_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_10 (InputLayer)       [(None, 20, 1)]           0         
                                                                 
 lstm_14 (LSTM)              (None, 20, 5)             140       
                                                                 
 lstm_15 (LSTM)              (None, 4)                 160       
                                                                 
 dense_8 (Dense)             (None, 1)                 5         
                                                                 
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________

As you see here, the flow of tensors (more specifically how the shapes of tensors change as they flow down the network) are displayed. As you can see, I am using the functional API which allows me to specifically create an input layer of the shape 20,1 which I then pass to the LSTM. But interestingly, you can see that the actual shape of this Input layer is (None, 20, 1). This is the batch, timesteps, features that you are also referring to.

The time steps are 20, and a single feature, so thats easy to understand, however, the None is a placeholder for the batch_size parameter which you define during the model.fit

#Fit model
X_train, y_train = np.random.random((100,20,1)), np.random.random((100,))
model.fit(X_train, y_train, batch_size=10, epochs=2)

Epoch 1/2
10/10 [==============================] - 1s 4ms/step - loss: 0.6938
Epoch 2/2
10/10 [==============================] - 0s 3ms/step - loss: 0.6932

In this example, I set the batch_size to 10. This means, that when you train the model, each "step" will pass batches of the shape (10, 20, 1) to the model and there will be 10 such steps in each epoch, because the overall size of the training data is (100, 20, 1). This is indicated by the 10/10 that you see in front of the progress bar for each epoch.

Another interesting thing to note, is that you dont necessarily need to define the dimensions of the input as long as your obey the basic rules of model training and batch size constraints. Here is an example. Here I define the number of timesteps as None which means that I can now pass variable length timesteps (variable length sentences for an example) to encode using the LSTM layers.

from tensorflow.keras import layers, Model
import numpy as np

inp = layers.Input((None,1))                       #<------
x = layers.LSTM(5, return_sequences=True)(inp)
x = layers.LSTM(4)(x)
out = layers.Dense(1, activation='sigmoid')(x)

model = Model(inp, out)
model.compile(loss='binary_crossentropy')
model.summary()

Model: "model_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_12 (InputLayer)       [(None, None, 1)]         0         
                                                                 
 lstm_18 (LSTM)              (None, None, 5)           140       
                                                                 
 lstm_19 (LSTM)              (None, 4)                 160       
                                                                 
 dense_10 (Dense)            (None, 1)                 5         
                                                                 
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________

This means that the model doesn't need to know how many timesteps it will have to work with beforehand, similar to the fact that it doesn't need to know what batch_size it would get beforehand. These things can be interpreted during the model.fit or passed as a parameter. Notice the model.summary() simply extends this lack of information around the timesteps dimension to the subsequent layers.

An important note though - LSTMs can work with variable size inputs because all you have to do is pass the timesteps as None in the example above, however, you have to ensure that each batch independently has the same number of time steps. In other words, to work with variable-sized sentences say [(20,1), (25, 1), (20, 1), ...] either use a batch size of 1 so that each batch has a consistent size, or create a generator which creates batches of equal batch_size and combine sentences with constant length. For example the first batch is only 5 (20,1) sentences, the second batch is only 5 (25,1) sentences etc. The second method is faster than the first, but may be more painful to setup.

Bonus

Also, for anyone curious around what is the effect of batch_size on model training, a large batch_size might be very helpful to speed up computation speed as its preferred over decaying the learning rate but it can cause what is known as a Generalization Gap. This topic is well explored in this awesome paper.

These 2 papers should give a lot of clarity around how to use batch_size as a powerful parameter for your model training, which is quite often ignored.