machine-translation seq2seq bleu neural-mt

Is it okay to compare Test BLEU score between NMT models while using a slightly modified standard test sets?

I am using tst2013.en found here, as my test sets to get the Test BLEU score to compare to other previous models. However, I have to filter out some sentences that are longer than 100 words otherwise I won't have the resource to run the model.

But with a slightly modified test sets, is it acceptable to compare the Test BLEU score to other models that use the unmodified test sets?

Solution

No, the important thing for the scores to be comparable is keeping the target side of the test data intact. Removing longer sentences would probably give you an unfair boost in the BLEU score because all systems tend to perform worse on longer sentences.

If your model really cannot handle sentences which are longer than 100 words (maybe you can reduce the batch size?), the correct solution to your problem is:

cut the source side of the test dataset, such that the sentences are at most 100 words long, do not remove them
translate the modified source side of the dataset
evaluate the translations using the unchanged target side of the test data