Why I am getting less BLEU score?

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'ae', 'test']]
candidate = ['this', 'is', 'ad', 'test']
score = sentence_bleu(reference, candidate)
print(score)

I am using this code to calculate the BLEU score and the score I am getting is 1.0547686614863434e-154. I wander why I am getting so small value even only one letter is different in candidate list.

score = sentence_bleu(reference, candidate,weights = [1])

I tried adding weight = [1] as a parameter and it gave me 0.75 as output. I cant understand why I have to add weight to get a reasonable result. Any help would be appreciated.

I thought its maybe because the sentence is not long enough so I added more words:

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'ae', 'test','rest','pep','did']]
candidate = ['this', 'is', 'ad', 'test','rest','pep','did']
score = sentence_bleu(reference, candidate)
print(score)

Now I am getting 0.488923022434901 but still I think is too low value.

Solution

By default, sentence_bleu is configured with 4 weights: 0.25 for unigram, 0.25 for bigram, 0.25 for trigram, 0.25 for quadrigram. The length of weights give the order of ngram, so the BLEU score is computed for 4 levels of ngrams.

When you use weights=[1], you only analyze unigram:

reference = [['this', 'is', 'ae', 'test','rest','pep','did']]
candidate = ['this', 'is', 'ad', 'test','rest','pep','did']

>>> sentence_bleu(reference, candidate)  # default weights, order of ngrams=4
0.488923022434901

But you can also consider unigrams are more important than bigrams which are more important than tri and quadigrams:

>>> sentence_bleu(reference, candidate, weights=[0.5, 0.3, 0.1, 0.1])
0.6511772622175621

You can also use SmoothingFunction methods and read the docstring from source code to better understanding.