I am training a neural network to distinguish between three classes. Naturally, I went for PyTorch's CrossEntropyLoss
. During experimentation, I realized that the loss was significantly higher when a Softmax
layer was put at the end of the model. So I decided to experiment further:
import torch
from torch import nn
pred_1 = torch.Tensor([[0.1, 0.2, 0.7]])
pred_2 = torch.Tensor([[1, 2, 7]])
pred_3 = torch.Tensor([[2, 4, 14]])
true = torch.Tensor([2]).long()
loss = nn.CrossEntropyLoss()
print(loss(pred_1, true))
print(loss(pred_2, true))
print(loss(pred_3, true))
The result of this code is as follows:
0.7679
0.0092
5.1497e-05
I also tried what happens when multiplying the input with some constant.
Several sources (1, 2) stated that the loss has a softmax built in, but if that were the case, I would have expected all of the examples above to return the same loss, which clearly isn't the case.
This poses the following question: if bigger outputs lead to a lower loss, wouldn't the network optimize towards outputting bigger values, thereby skewing the loss curves? If so, it seems like a Softmax
layer would fix that. But since this results in a higher loss overall, how useful would the resulting loss actually be?
From the docs, the input to CrossEntropyLoss
"is expected to contain raw, unnormalized scores for each class". Those are typically called logits.
There are two questions: