Search code examples
nltkbleu

NLTK sentence_bleu method 7 gives scores above 1


When using the NLTK sentence_bleu function in combination with SmoothingFunction method 7, the max score is 1.1167470964180197. This while the BLEU score is defined to be between 0 and 1.

This score shows up for perfect matches with the reference. I'm using method 7 since I do not always have sentences of length 4, some may be lower. Using method 5 gives the same result. Other methods do give 1.0 as a perfect score.

It occurs when I use a single reference and candidate, for example:

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
cc = SmoothingFunction()
reference = ['overofficious 98461 54363 39016 78223 52180']
candidate = 'overofficious 98461 54363 39016 78223 52180'
sentence_bleu(reference, candidate, smoothing_function=cc.method7)

This gives the score: 1.1167470964180197

Am I doing something wrong, is this expected behavior or is there a bug in the implementation of the smoothing function?


Solution

  • It looks like this implementation is at least consistent with Chen and Cherry, 2014. They suggested to average n-1, n, n+1 -gram counts. The also defined m0_prime as m1 + 1 (so in our case it will be 2 and that breaks our computations).

    I'm using method5 (it's used by method7) from here.

    cc = SmoothingFunction()
    references = ['overofficious 98461 54363 39016 78223 52180'.split()]
    candidate = 'overofficious 98461 54363 39016 78223 52180'.split()
    p_n = [Fraction(1, 1)] * 4
    p_n5 = cc.method5(p_n, references, candidate, len(candidate))
    

    Output:

    [Fraction(4, 3), Fraction(10, 9), Fraction(28, 27), Fraction(82, 81)]
    

    We may compute 4/3 like this: (2 + 1 + 1) / 3; 10/9 = (4/3 + 1 + 1) / 3 and so on.