When using word alignment tools like fast_align, does more sentences mean better accuracy?

I am using fast_align https://github.com/clab/fast_align to get word alignments between 1000 German sentences and 1000 English translations of those sentences. So far the quality is not so good.

Would throwing more sentences into the process help fast_align to be more accurate? Say I take some OPUS data with 100k aligned sentence pairs and then add my 1000 sentences in the end of it and feed it to fast_align. Will that help? I can't seem to find any info on whether this would make sense.

Solution

[Disclaimer: I know next to nothing about alignment and have not used fast_align.]

Yes.

You can prove this to yourself and also plot the accuracy/scale curve by removing data from your dataset to try it at at even lower scale.

That said, 1000 is already absurdly low, for these purposes 1000 ≈≈ 0, and I would not expect it to work.

More ideal would be to try 10K, 100K and 1M. More comparable to others' results would be some standard corpus, eg Wikipedia or data from the research workshops.

Adding data very different than the data that is important to you can have mixed results, but in this case more data can hardly hurt. We could be more helpful with suggestions if you mention a specific domain, dataset or goal.