When using the NLTK sentence_bleu
function in combination with SmoothingFunction
method 7, the max score is 1.1167470964180197
. This while the BLEU score is defined to be between 0
and 1
.
This score shows up for perfect matches with the reference. I'm using method 7 since I do not always have sentences of length 4, some may be lower. Using method 5 gives the same result. Other methods do give 1.0 as a perfect score.
It occurs when I use a single reference and candidate, for example:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
cc = SmoothingFunction()
reference = ['overofficious 98461 54363 39016 78223 52180']
candidate = 'overofficious 98461 54363 39016 78223 52180'
sentence_bleu(reference, candidate, smoothing_function=cc.method7)
This gives the score: 1.1167470964180197
Am I doing something wrong, is this expected behavior or is there a bug in the implementation of the smoothing function?
It looks like this implementation is at least consistent with Chen and Cherry, 2014. They suggested to average n-1, n, n+1
-gram counts. The also defined m0_prime
as m1 + 1
(so in our case it will be 2 and that breaks our computations).
I'm using method5
(it's used by method7
) from here.
cc = SmoothingFunction()
references = ['overofficious 98461 54363 39016 78223 52180'.split()]
candidate = 'overofficious 98461 54363 39016 78223 52180'.split()
p_n = [Fraction(1, 1)] * 4
p_n5 = cc.method5(p_n, references, candidate, len(candidate))
Output:
[Fraction(4, 3), Fraction(10, 9), Fraction(28, 27), Fraction(82, 81)]
We may compute 4/3
like this: (2 + 1 + 1) / 3
; 10/9 = (4/3 + 1 + 1) / 3
and so on.