I'm trying to evaluate Chinese sentence BLEU scores with NLTK's sentence_bleu()
function. The code is as follows:
import nltk
import jieba
from transformers import AutoTokenizer, BertTokenizer, BartForConditionalGeneration
src = '樓上漏水耍花招不處理可以怎麼做'
ref = '上層漏水耍手段不去處理可以怎麼做'
checkpoint = 'fnlp/bart-base-chinese'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BartForConditionalGeneration.from_pretrained(checkpoint)
hypothesis_translations = []
for sentence in [src]:
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=100, return_token_type_ids=False)
outputs = model.generate(**inputs)
translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
hypothesis_translations.append(translated_sentence)
# for Reference tokenization
inputs_ref = tokenizer(ref, return_tensors="pt", truncation=True, max_length=100, return_token_type_ids=False)
outputs_ref = model.generate(**inputs_ref)
tokenized_ref = tokenizer.decode(outputs_ref[0], skip_special_tokens=True)
nltk_bleu = nltk.translate.bleu_score.sentence_bleu(tokenized_ref, hypothesis_translations)
print(nltk_bleu)
The output of printing nltk_bleu
is 0
.
But when I use the corpus_score()
of SacreBLEU
library, it returns normal and expected results:
import evaluate
from sacrebleu.metrics import BLEU
bleu = BLEU()
bleu_score = bleu.corpus_score(references=tokenized_ref, hypotheses=hypothesis_translations)
print(bleu_score)
which returns:
BLEU = 4.79 73.3/3.6/1.9/1.0 (BP = 1.000 ratio = 15.000 hyp_len = 15 ref_len = 1)
How can I make the NLTK sentence_score
return correct results?
UPDATE After adding NLTK's Method 3 into consideration:
from nltk.translate.bleu_score import SmoothingFunction
smooth_fn = SmoothingFunction()
nltk_bleu = nltk.translate.bleu_score.sentence_bleu(tokenized_ref, hypothesis_translations, smoothing_function=smooth_fn.method3)
the value of nltk_bleu
is still 0
.
The function sentence_bleu
expects a list of list of tokens as reference, and a list of tokens as hypothesis. Your supplied input just does not correlate with the expectations.
Once you fix it, you will get:
smooth_fn = SmoothingFunction()
nltk_bleu = nltk.translate.bleu_score.sentence_bleu([tokenized_ref.split(' ')], hypothesis_trans
lations[0].split(' '), smoothing_function=smooth_fn.method3)
print(nltk_bleu)
>>>
0.43560338053780967
Also, you should take into account that by default it calculates BLEU-4 (for 4-grams) and also consider difference from the smoothing functions.