What's the difference between NLTK's BLEU score and SacreBLEU?

I'm curious if anyone is familiar with the difference between using NLTK's BLEU score calculation and the SacreBLEU library.

In particular, I'm using both library's sentence BLEU scores, averaged over the entire dataset. The two give different results:

>>> from nltk.translate import bleu_score
>>> from sacrebleu import sentence_bleu
>>> print(len(predictions))
256
>>> print(len(targets))
256
>>> prediction = "this is the first: the world's the world's the world's the \
... world's the world's the world's the world's the world's the world's the world \
... of the world of the world'"
...
>>> target = "al gore: so the alliance for climate change has launched two campaigns."
>>> print(bleu_score.sentence_bleu([target], prediction))
0.05422283394039736
>>> print(sentence_bleu(prediction, [target]).score)
0.0
>>> print(sacrebleu.corpus_bleu(predictions, [targets]).score)
0.678758518214081
>>> print(bleu_score.corpus_bleu([targets], [predictions]))
0

As you can see, there's a lot of confusing inconsistencies going on. There's no way that my BLEU score is 67.8%, but it's also not supposed to be 0% (there are a lot of overlapping n-grams like "the").

I'd appreciate it if anyone could shed some light on this. Thanks.

Solution

NLTK and SacreBLEU use different tokenization rules, mostly in how they handle punctuation. NLTK uses its own tokenization, whereas SacreBLEU replicates the original Perl implementation from 2002. The tokenization rules are probably more elaborate in NLTK, but they make the number incomparable with the original implementation.

The corpus BLEU that you got from SacreBLEU is not 67.8%, but 0.67% – the numbers from SacreBLEU are already multiplied by 100, unlike NLTK. So, I would not say there is a huge difference between the scores.

The sentence-level BLEU can use different smoothing techniques that should ensure that score would get reasonable values even if 3-gram of 4-gram precision would be zero. However, note that BLEU as a sentence-level metric is very unreliable.