Compute corpus-level BLEU score for translations in Python via SacreBLEU

I have more than 100K pairs of the parallel corpus. Samples:

[
  ["How are you doing today", "comment allez-vous aujourd'hui"], 
  ["Look out! He is a thief", "Chercher! C'est un voleur"], 
  ...(and a lot more pairs of English-French translations)
]

From evaluate Python library, the sample code is as follow:

import evaluate
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]
sacrebleu = evaluate.load("sacrebleu")
results = sacrebleu.compute(predictions=predictions,  references=references)
print(results["score"])

which will print 100.0000004, since there is an exact match of the predictions from the references.

I would like to obtain the corpus-level BLEU Score of the above parallel datasets, in order to know the quality of translations. How can I adjust the codes to apply the dataset? Thanks.

Solution

The problem you are describing can be reproduced on a single sentence pair:

import sacrebleu
sacrebleu.raw_corpus_bleu("hit the ceiling", "hit the roof", 0.0).score / 100

Which returns 1.0000000000000004.

If we wrap the reference in a list:

sacrebleu.raw_corpus_bleu("hit the ceiling", ["hit the roof"], 0.0).score / 100

we now get 0.5999999999999999

I am not familiar with the function you used but you can find the reference with the highest score this way too.