Why masking input produces the same loss as unmasked input on Keras?

I am experimenting with LSTM using variable-length input due to this reason. I wanted to be sure that loss is calculated correctly under masking. So, I trained the below model that uses Masking layer on padded sequences.

from tensorflow.keras.layers import LSTM, Masking, Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models, losses
import tensorflow as tf
import numpy as np
import os

"""
For generating reproducible results, set seed. 
"""
def set_seed(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    
"""
Set some right most indices to mask value like padding
"""
def create_padded_seq(num_samples, timesteps, num_feats, mask_value):
    feats = np.random.random([num_samples, timesteps, num_feats]).astype(np.float32)            # Generate samples
    for i in range(0, num_samples):
        rand_index = np.random.randint(low=2, high=timesteps, size=1)[0]                        # Apply padding
        feats[i, rand_index:, 0] = mask_value
    return feats

set_seed(42)
num_samples = 100
timesteps = 6
num_feats = 1
num_classes = 3
num_lstm_cells = 1
mask_value = -100
num_epochs = 5

X_train = create_padded_seq(num_samples, timesteps, num_feats, mask_value)
y_train = np.random.randint(num_classes, size=num_samples)
cat_y_train = to_categorical(y_train, num_classes)

masked_model = models.Sequential(name='masked')
masked_model.add(Masking(mask_value=mask_value, input_shape=(timesteps, num_feats)))
masked_model.add(LSTM(num_lstm_cells, return_sequences=False))
masked_model.add(Dense(num_classes, activation='relu'))
masked_model.compile(loss=losses.categorical_crossentropy, optimizer='adam', metrics=["accuracy"])
print(masked_model.summary())
masked_model.fit(X_train, cat_y_train, batch_size=1, epochs=5, verbose=True)

This is the verbose output,

Model: "masked"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
masking (Masking)            (None, 6, 1)              0         
_________________________________________________________________
lstm (LSTM)                  (None, 1)                 12        
_________________________________________________________________
dense (Dense)                (None, 3)                 6         
=================================================================
Total params: 18
Trainable params: 18
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 2/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 3/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 4/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 5/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400

I also removed Masking layer and trained another model on the same data to see the effect of masking, this is the model,

unmasked_model = models.Sequential(name='unmasked')
unmasked_model.add(LSTM(num_lstm_cells, return_sequences=False, input_shape=(timesteps, num_feats)))
unmasked_model.add(Dense(num_classes, activation='relu'))
unmasked_model.compile(loss=losses.categorical_crossentropy, optimizer='adam', metrics=["accuracy"])
print(unmasked_model.summary())
unmasked_model.fit(X_train, cat_y_train, batch_size=1, epochs=5, verbose=True)

And this is the verbose output,

Model: "unmasked"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 1)                 12        
_________________________________________________________________
dense (Dense)                (None, 3)                 6         
=================================================================
Total params: 18
Trainable params: 18
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
100/100 [==============================] - 0s 1ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 2/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 3/5
100/100 [==============================] - 0s 1ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 4/5
100/100 [==============================] - 0s 1ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 5/5
100/100 [==============================] - 0s 1ms/step - loss: 10.6379 - accuracy: 0.3400

Losses are the same in both outputs, what is the reason for that ? It seems like Masking layer has no effect on loss, is that correct ? If not, then how can I observe the effect of Masking layer ?

Solution

In the case of a multi-classification task, the problem seems to be the last activation function...

If you change relu with softmax, your network can produce probabilities in the range [0,1]