python deep-learning neural-network pytorch loss-function

How to convert probability to angle degree in a head-pose estimation problem?

I reused code from others to make head-pose prediction in Euler angles. The author trained a classification network that returns bin classification results for the three angles, i.e. yaw, roll, pitch. The number of bins is 66. They somehow convert the probabilities to the corresponding angle, as written from line 150 to 152 here. Could someone help to explain the formula?

These are the relevant lines of code in the above file:

[56]  model = hopenet.Hopenet(torchvision.models.resnet.Bottleneck, [3, 4, 6, 3], 66) # a variant of ResNet50
[80]  idx_tensor = [idx for idx in xrange(66)]
[81]  idx_tensor = torch.FloatTensor(idx_tensor).cuda(gpu)
[144] yaw, pitch, roll = model(img)
[146] yaw_predicted = F.softmax(yaw)
[150] yaw_predicted = torch.sum(yaw_predicted.data[0] * idx_tensor) * 3 - 99

Solution

If we look at the training code, and the authors' paper,* we see that the loss function is a sum of two losses:

the raw model output (vector of probabilities for each bin category):

[144] yaw, pitch, roll = model(img)

a linear combination of the bin predictions (the predicted continuous angle):

[146] yaw_predicted = F.softmax(yaw)
[150] yaw_predicted = torch.sum(yaw_predicted.data[0] * idx_tensor) * 3 - 99

Since 3 * softmax(label_weighted_sum(output)) - 99 is the final layer in training the regression loss (but is not explicitly a part of the model's forward), this must be applied to the raw output to convert it from the vector of bin probabilities to a single angle prediction.

^{3.2. The Multi-Loss Approach
All previous work which predicted head pose using convolutional networks regressed all three Euler angles directly using a mean squared error loss. We notice that this approach does not achieve the best results on our large-scale synthetic training data.
We propose to use three separate losses, one for each angle. Each loss is a combination of two components: a binned pose classification and a regression component. Any
backbone network can be used and augmented with three fully-connected layers which predict the angles. These three fully-connected layers share the previous convolutional layers of the network.
The idea behind this approach is that by performing bin classification we use the very stable softmax layer and cross-entropy, thus the network learns to predict the neighbourhood of the pose in a robust fashion. By having three cross-entropy losses, one for each Euler angle, we have three signals which are backpropagated into the network
which improves learning. In order to obtain a fine-grained predictions we compute the expectation of each output angle for the binned output. The detailed architecture is shown
in Figure 2.
We then add a regression loss to the network, namely a mean-squared error loss, in order to improve fine-grained predictions. We have three final losses, one for each angle,
and each is a linear combination of both the respective classification and the regression losses. We vary the weight of the regression loss in Section 4.4 and we hold the weight of
the classification loss constant at 1. The final loss for each Euler angle is the following:

Where H and MSE respectively designate the crossentropy and mean squared error loss functions.}