Search code examples
nltkbleu

Why Sacrebleu returns zero BLEU score for short sentences?


Why scarebleu needs that sentences ends with dot? If I remove dots, the value is zero.

import sacrebleu, nltk
sys = ["This is cat."] 
refs = [["This is a cat."], 
        ["This is a bad cat."]] 

b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))

This returns the following:

b3 35.1862973998119
b3 35.19

When I remove the ending dots.

sys = ["This is cat"] 
refs = [["This is a cat"], 
        ["This is a bad cat"]] 


b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))

It prints zero using scarebleu which is again weird!:

b3 0.0
b3 0.0

Solution

  • BLEU is defined as a geometrical average of (modified) n-gram precisions for unigrams up to 4-grams (times brevity penalty). Thus if there is no matching 4-gram (no 4-tuple of words) in the whole test set, BLEU is 0 by definition. having a dot at the end which will get tokenized, makes it so that that there are now matches for 4-grams because smoothing is applied.

    BLEU was designed for scoring test sets with hundreds of sentences where such case is very unlikely. For scoring single sentences, you can use a sentence-level version of BLEU which uses some kind of smoothing, but the results are still not ideal. You can also use a character-based metric, e.g. chrF (sacrebleu -m chrf).

    You can also pass use_effective_order=True to corpus_bleu so that only the matched n-gram orders are counted instead of 4 n-grams. However, in that case, the metric is not exactly what people would refer to BLEU.