How to use BLEU score to compare your model to existing models?

So I am using the BLEU score metric to compare my NMT model's performance with existing models. However, I'm wondering how many settings do I have to match with the other models.

Settings like dev sets, test sets and hyperparameters I think are doable. However, the preprocessing step I use is different from existing models and so I'm wondering if the BLEU score of my model can be compared with others. There are also chances that existing models have hidden parameters that were not reported.

https://arxiv.org/pdf/1804.08771.pdf addresses the problem of reporting BLEU and calls to switch to SacreBLEU. But many existing models use BLEU so I don't think I can use the SacreBLEU score metric on my model.

Solution

tl;dr

SacreBLEU is not a different metric, it is an implementation of BLEU, so what you see reported in papers as BLEU, it should be comparable with what you get from SacreBLEU. Use SacreBLEU whenever you can.

Brief history of the BLEU score

The BLEU score is very sensitive to tokenization, so it is important that everyone uses the same one. Originally, there was a Perl implementation from 2001 which was considered the canonical implementation of BLEU for a long time. Using the script has many hassles (it is in Perl, requires the data to be in a rather obscure SGM format). Because of that (and because BLEU score is fairly simple) many independent implementations appeared, e.g., in MultEval, NLTK. They are easier to use but due to some subtle differences in data preprocessing do not yield the same results. SacreBLEU can do the same tokenization and gets the same scores as the original Perl script, but reads data in plaintext and is in Python which is currently used the most in machine translation.