machine-learning nlp machine-translation bleu

Should the BLEU score for subword NMT be calculated on the subwords or should they be joined first?

This wasn't too clear in the papers I've read. When a model is trained on a bilingual corpus that was split into subwords e.g. via Byte-Pair Encoding, is it standard to calculate the BLEU score on the subword outputs or on the full words after rejoining the subwords?

Solution

BLEU score is always computed on complete tokens, otherwise, the BLEU scores would not be comparable across models with different word segmentation. Even small differences in tokenization can make a big difference in the final score. This is well-explained in a recent paper that introduces SacreBLEU which is now used as a standard tool used for reporting BLEU scores in academic papers.

When computing BLEU on BPE subwords instead of words, the score would become artificially high. Even if the translation quality is quite low, the models usually don't have problems with getting single words correct. Normally, it would be included only in the unigram precision, but with words split into multiple subwords, it would also increase bigram, trigram and perhaps also 4-gram precision.