This training curve is for a Transformer model that processes 2D (excluding batch) sequential signal and uses Adam optimizer, 32 batch size and for the learning rate: a custom LR Scheduler that replicates the warmup scheduler that is used at 'Attention is All You Need' paper. Training curve as below plateaus with eventual Training loss slightly lower than Validation loss, but training loss never starts back to climb, which I interpreted as the model never starts overfitting and just stops re-adjusting weights after around epoch 90.
Better interpretation and solutions to improve this model?
Below is my brief reproducible code:
x_train = np.random.normal(size=(32, 512, 512))
batch_size = 32
H, W = x_train.shape
rows, cols = np.indices((H, W), sparse=True)
padding_mask_init = np.zeros((H, W, W), dtype=np.bool_)
padding_mask_init[rows, 1:, cols] = 1
padding_mask = padding_mask_init[:batch_size]
embed_dim = 512
dense_dim = 2048
num_heads = 2
shape = (batch_size, embed_dim, 512) #(32, 512, 512)
decoder_inputs = layers.Input(batch_input_shape=shape, dtype=tensorflow.float16)
mha_1 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
mha_2 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
layernorm_1 = layers.LayerNormalization()
Z = decoder_inputs
Z = mha_1(query=Z, value=Z, key=Z, use_causal_mask=True, attention_mask=padding_mask)
Z = layernorm_1(Z + decoder_inputs)
Z = mha_2(query=Z, value=decoder_inputs, key=decoder_inputs, attention_mask=padding_mask)
outputs = layers.TimeDistributed(keras.layers.Dense(embed_dim, activation="softmax"))(Z)
model = keras.Model(decoder_inputs, outputs)
model.compile(loss="mean_squared_error", optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule(embed_dim, 3000),beta_1=0.9,beta_2=0.98,epsilon=1.0e-9), metrics=["accuracy"])
history = model.fit(dataset, epochs=200, validation_data=val_dataset)
First, I will leave you with a blog post source which explains nicely how to go about the training process of a deep neural network. I believe this will be helpful reading no matter what.
Second, a not climbing training loss can not be interpreted as non-overfitting. Rather, the missing validation loss climb could indicate that your model does not overfit.
In the end optimization is all about capacity. If you model is not able to overfit, you could try increasing the number of parameters of your model, or initially try to reduce the data to see whether you are able to overfit on a small subset of your overall dataset.
Things to consider next:
Note: Without knowing your data, this is really difficult to answer.