I am using tst2013.en found here, as my test sets to get the Test BLEU
score to compare to other previous models. However, I have to filter out some sentences that are longer than 100 words otherwise I won't have the resource to run the model.
But with a slightly modified test sets, is it acceptable to compare the Test BLEU
score to other models that use the unmodified test sets?
No, the important thing for the scores to be comparable is keeping the target side of the test data intact. Removing longer sentences would probably give you an unfair boost in the BLEU score because all systems tend to perform worse on longer sentences.
If your model really cannot handle sentences which are longer than 100 words (maybe you can reduce the batch size?), the correct solution to your problem is: