I have more than 100K pairs of the parallel corpus. Samples:
[
["How are you doing today", "comment allez-vous aujourd'hui"],
["Look out! He is a thief", "Chercher! C'est un voleur"],
...(and a lot more pairs of English-French translations)
]
From evaluate
Python library, the sample code is as follow:
import evaluate
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]
sacrebleu = evaluate.load("sacrebleu")
results = sacrebleu.compute(predictions=predictions, references=references)
print(results["score"])
which will print 100.0000004
, since there is an exact match of the predictions from the references.
I would like to obtain the corpus-level BLEU Score of the above parallel datasets, in order to know the quality of translations. How can I adjust the codes to apply the dataset? Thanks.
The problem you are describing can be reproduced on a single sentence pair:
import sacrebleu
sacrebleu.raw_corpus_bleu("hit the ceiling", "hit the roof", 0.0).score / 100
Which returns 1.0000000000000004.
If we wrap the reference in a list:
sacrebleu.raw_corpus_bleu("hit the ceiling", ["hit the roof"], 0.0).score / 100
we now get 0.5999999999999999
I am not familiar with the function you used but you can find the reference with the highest score this way too.