Search code examples
pythonnltkcjkbleu

NLTK sentence_bleu() returns 0 while evaluating Chinese sentences


I'm trying to evaluate Chinese sentence BLEU scores with NLTK's sentence_bleu() function. The code is as follows:

import nltk
import jieba

from transformers import AutoTokenizer, BertTokenizer, BartForConditionalGeneration

src = '樓上漏水耍花招不處理可以怎麼做'
ref = '上層漏水耍手段不去處理可以怎麼做'

checkpoint = 'fnlp/bart-base-chinese'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BartForConditionalGeneration.from_pretrained(checkpoint)

hypothesis_translations = []

for sentence in [src]:
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=100, return_token_type_ids=False)
    outputs = model.generate(**inputs)
    translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    hypothesis_translations.append(translated_sentence)

# for Reference tokenization
inputs_ref = tokenizer(ref, return_tensors="pt", truncation=True, max_length=100, return_token_type_ids=False)
outputs_ref = model.generate(**inputs_ref)
tokenized_ref = tokenizer.decode(outputs_ref[0], skip_special_tokens=True)

nltk_bleu = nltk.translate.bleu_score.sentence_bleu(tokenized_ref, hypothesis_translations)
print(nltk_bleu)

The output of printing nltk_bleu is 0.

But when I use the corpus_score() of SacreBLEU library, it returns normal and expected results:

import evaluate
from sacrebleu.metrics import BLEU

bleu = BLEU()
bleu_score = bleu.corpus_score(references=tokenized_ref, hypotheses=hypothesis_translations)
print(bleu_score)

which returns:

BLEU = 4.79 73.3/3.6/1.9/1.0 (BP = 1.000 ratio = 15.000 hyp_len = 15 ref_len = 1)

How can I make the NLTK sentence_score return correct results?


UPDATE After adding NLTK's Method 3 into consideration:

from nltk.translate.bleu_score import SmoothingFunction
smooth_fn = SmoothingFunction()
nltk_bleu = nltk.translate.bleu_score.sentence_bleu(tokenized_ref, hypothesis_translations, smoothing_function=smooth_fn.method3)

the value of nltk_bleu is still 0.


Solution

  • The function sentence_bleu expects a list of list of tokens as reference, and a list of tokens as hypothesis. Your supplied input just does not correlate with the expectations.

    Once you fix it, you will get:

    smooth_fn = SmoothingFunction()
    nltk_bleu = nltk.translate.bleu_score.sentence_bleu([tokenized_ref.split(' ')], hypothesis_trans
    lations[0].split(' '), smoothing_function=smooth_fn.method3)
    print(nltk_bleu)
    
    >>>
    0.43560338053780967
    

    Also, you should take into account that by default it calculates BLEU-4 (for 4-grams) and also consider difference from the smoothing functions.