In Keras, LSTM is in the shape of [batch, timesteps, feature]. What if I indicate the input as keras.Input(shape=(20, 1)) and feed a matrix of (100, 20, 1) as input? What's the number of batch that it's considering in this case? Is the batch size 100 with 20 time stems in each batch?
The batch, timestep, features
in your case is defined as None, 20, 1
, where the batch represents the batch_size
parameter passed during model.fit
. The model does not need to know this before hand. Therefore, when you define your input layer (or your LSTM layer's input shape), you simply defined (timesteps, features)
which is (20, 1)
. A simple model.summary()
would show you that that input size is translated to (None, 20, 1)
while creating the computation graph.
A good way to understand whats going on is to simply print the summary of your model. Let me take a simple example here and walk you through the steps -
#Creating a simple stacked LSTM model
from tensorflow.keras import layers, Model
import numpy as np
inp = layers.Input((20,1)) #<------
x = layers.LSTM(5, return_sequences=True)(inp)
x = layers.LSTM(4)(x)
out = layers.Dense(1, activation='sigmoid')(x)
model = Model(inp, out)
model.compile(loss='binary_crossentropy')
model.summary()
Model: "model_8"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_10 (InputLayer) [(None, 20, 1)] 0
lstm_14 (LSTM) (None, 20, 5) 140
lstm_15 (LSTM) (None, 4) 160
dense_8 (Dense) (None, 1) 5
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________
As you see here, the flow of tensors (more specifically how the shapes of tensors change as they flow down the network) are displayed. As you can see, I am using the functional API which allows me to specifically create an input layer of the shape 20,1
which I then pass to the LSTM. But interestingly, you can see that the actual shape of this Input
layer is (None, 20, 1)
. This is the batch, timesteps, features
that you are also referring to.
The time steps are 20, and a single feature, so thats easy to understand, however, the None
is a placeholder for the batch_size
parameter which you define during the model.fit
#Fit model
X_train, y_train = np.random.random((100,20,1)), np.random.random((100,))
model.fit(X_train, y_train, batch_size=10, epochs=2)
Epoch 1/2
10/10 [==============================] - 1s 4ms/step - loss: 0.6938
Epoch 2/2
10/10 [==============================] - 0s 3ms/step - loss: 0.6932
In this example, I set the batch_size
to 10. This means, that when you train the model, each "step" will pass batches of the shape (10, 20, 1)
to the model and there will be 10 such steps in each epoch, because the overall size of the training data is (100, 20, 1). This is indicated by the 10/10
that you see in front of the progress bar for each epoch.
Another interesting thing to note, is that you dont necessarily need to define the dimensions of the input as long as your obey the basic rules of model training and batch size constraints. Here is an example. Here I define the number of timesteps as None
which means that I can now pass variable length timesteps (variable length sentences for an example) to encode using the LSTM layers.
from tensorflow.keras import layers, Model
import numpy as np
inp = layers.Input((None,1)) #<------
x = layers.LSTM(5, return_sequences=True)(inp)
x = layers.LSTM(4)(x)
out = layers.Dense(1, activation='sigmoid')(x)
model = Model(inp, out)
model.compile(loss='binary_crossentropy')
model.summary()
Model: "model_10"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_12 (InputLayer) [(None, None, 1)] 0
lstm_18 (LSTM) (None, None, 5) 140
lstm_19 (LSTM) (None, 4) 160
dense_10 (Dense) (None, 1) 5
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________
This means that the model doesn't need to know how many timesteps it will have to work with beforehand, similar to the fact that it doesn't need to know what batch_size it would get beforehand. These things can be interpreted during the model.fit
or passed as a parameter. Notice the model.summary()
simply extends this lack of information around the timesteps dimension to the subsequent layers.
An important note though - LSTMs can work with variable size inputs because all you have to do is pass the timesteps as
None
in the example above, however, you have to ensure that each batch independently has the same number of time steps. In other words, to work with variable-sized sentences say[(20,1), (25, 1), (20, 1), ...]
either use a batch size of 1 so that each batch has a consistent size, or create a generator which creates batches of equal batch_size and combine sentences with constant length. For example the first batch is only 5 (20,1) sentences, the second batch is only 5 (25,1) sentences etc. The second method is faster than the first, but may be more painful to setup.
Also, for anyone curious around what is the effect of batch_size
on model training, a large batch_size
might be very helpful to speed up computation speed as its preferred over decaying the learning rate but it can cause what is known as a Generalization Gap
. This topic is well explored in this awesome paper.
These 2 papers should give a lot of clarity around how to use batch_size
as a powerful parameter for your model training, which is quite often ignored.