Softmax activation gives worst performance with loss sparse_categorical_crossentropy

I have a simple Keras sequential model. I have N categories and i have to predict in which category the next point will fall based on the previous one.

The weird thing is that when i remove the Softmax activation function from the output layer the performance are better (lower loss and highest sparse_categorical_accuracy). As loss i'm using the sparse_categorical_crossentropy with logits=True.

Is there any reason for that? Should not be the opposite?

Thank you in advance for any suggestion!

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True, 
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
    ])
  return model

model = build_model(
  vocab_size = vocab_size,
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)


model.compile(optimizer='adam', loss=loss, metrics=['sparse_categorical_accuracy'])

EPOCHS = 5
history = model.fit(train_set, epochs=EPOCHS, validation_data=val_set,)

Solution

In a nutshell, when you are using the option from_logits = True, you are telling the loss function that your neural network output is not normalized. Since you are using softmax activation in your last layer, your outputs are indeed normalized, so you have two options:

Remove the softmax activation as you have already tried. Keep in mind that, after this, your output probabilities won't be normalized.
Use from_logits = False.