How to interpret the P numbers that fairseq generate produces?

Using fairseq-generate.py, with the transformer architecture, each translation produces a section like this:

Why is it rare to discover new marine mammal species?
S-0     Why is it rare to discover new marine mam@@ mal species ?
H-0     -0.0643349438905716     Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0     -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015

With this explanation:

H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker

I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word? E.g. does -0.07 for "Pourquoi" means it was happier about that than it was (-0.1849) for "est-il"? And the low -0.0015 at the end means it was really confident the sentence should end there.

Background: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation. I've been analyzing a handful of translations against the H number and didn't notice much correspondence between it and my subjective opinion of translation quality. But I've a couple where I thought it was particularly poor - it had missed a bit of key information - and the final P number was a relatively high -0.6099 and -0.3091 (The final P number is -0.11 or so on most of them.)

Solution

Q: I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word?

Yes. As the docs says, "P is the positional score per token position". The score is actually the log probability, therefore the higher (i.e., the lower absolute number) the more "confident". The source-code may not be that easy to follow, but the scores are generated by the SequenceScorer, and there you can see that scores are normalized (which includes a log either if when you're using a single model or an ensemble). Moreover, when printing the scores, they convert them from base e to 2:
```
print('P-{}\t{}'.format(
    sample_id,
    ' '.join(map(
        lambda x: '{:.4f}'.format(x),
        # convert from base e to base 2
        hypo['positional_scores'].div_(math.log(2)).tolist(),
))
```

Q: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation.

It turns out that the H value is simply the average of the P values, as you can see here:
```
score_i = avg_probs_i.sum() / tgt_len
```
also converted to base 2. You can check that in your example:
```
import numpy as np
print(np.mean([-0.0763,-0.1849 ,-0.0956 ,-0.0946 ,-0.0735 ,-0.1150 ,-0.1301 ,-0.0042 ,-0.0321 ,-0.0171 ,-0.0052 ,-0.0062 ,-0.0015]))
# >>> -0.06433076923076922
```
Another measurement that is often used to assess the performance of a language model is Perplexity. And a good thing is that perplexity can be easily computed based on the P values, as shown in the Language Model example of the fairseq repository:
```
# Compute perplexity for a sequence
en_lm.score('Barack Obama is coming to Sydney and New Zealand')['positional_scores'].mean().neg().exp()
# tensor(15.1474)
```
I'm not an expert on NLP, so I can't really tell you which one you should use in your case.