Search code examples
nlpalignmentlanguage-translationmachine-translation

How do I interpret the alignment score from the alignment tool fast_align?


I'm using the alignment toolkit fast_align: https://github.com/clab/fast_align, to get word-to-word alignment of a parallel corpus. There is an option to print out the alignment score -- how do I interpret this score? Does the score measure the degree of alignment between the parallel sentences? I know that some of the sentences in the corpus are well aligned and others are not, but so far I see no correlation between the score and how well aligned they are. Should I adjust for the number of words in the sentence?


Solution

  • FastAlign is an implementation of IBM Model 2, the score is the probability estimated by this model. The details of the model are very nicely explained in these slides from JHU.

    The score is a probability of the source sentence given the target sentence words and the alignment. The algorithm iteratively estimates:

    1. The probabilities of being each other translation for (virtually all) pairs of the source language and the target language pairs.
    2. Optimal alignment given the word-to-word translation probabilities.

    The score is then a product of the word-to-word translation probabilities with the alignment the algorithm converged to. So, in theory, this should correlate with how parallel the sentences are, but there are so many ways in which this can break. For instance, rare words have unreliable probability estimates. Another problem might be some words (such as "of") can be part of multi-word expressions that are a single word in other languages, which skews the probability estimates as well. So, there is no wonder that the probability is not to be trusted.

    If your goal is to filter the parallel corpus and remove the incorrectly aligned sentence pairs, I would recommend something else. You can e.g., use Multilingual BERT as they did in a paper by Google, where they the centered vectors for cross-lingual retrieval. Or just google "parallel corpus filtering."