python machine-learning neural-network pytorch loss-function

Cross entropy yields different results for vectors with identical distributions

I am training a neural network to distinguish between three classes. Naturally, I went for PyTorch's CrossEntropyLoss. During experimentation, I realized that the loss was significantly higher when a Softmax layer was put at the end of the model. So I decided to experiment further:

import torch
from torch import nn

pred_1 = torch.Tensor([[0.1, 0.2, 0.7]])
pred_2 = torch.Tensor([[1, 2, 7]])
pred_3 = torch.Tensor([[2, 4, 14]])
true = torch.Tensor([2]).long()

loss = nn.CrossEntropyLoss()

print(loss(pred_1, true))
print(loss(pred_2, true))
print(loss(pred_3, true))

The result of this code is as follows:

0.7679
0.0092
5.1497e-05

I also tried what happens when multiplying the input with some constant.

Several sources (1, 2) stated that the loss has a softmax built in, but if that were the case, I would have expected all of the examples above to return the same loss, which clearly isn't the case.

This poses the following question: if bigger outputs lead to a lower loss, wouldn't the network optimize towards outputting bigger values, thereby skewing the loss curves? If so, it seems like a Softmax layer would fix that. But since this results in a higher loss overall, how useful would the resulting loss actually be?

Solution

From the docs, the input to CrossEntropyLoss "is expected to contain raw, unnormalized scores for each class". Those are typically called logits.

There are two questions:

Scaling the logits should not yield the same cross-entropy. You might be thinking of a linear normalization, but the (implicit) softmax in the cross-entropy normalizes the exponential of the logits.
This causes the learning to optimize toward larger values of the logits. This is exactly what you want because it means that the network is more "confident" of the classification prediction. (The posterior p(c|x) is closer to the ground truth.)