Search code examples
algorithmnlpmachine-translationmoses

How to get phrase tables from word alignments?


The output of my word alignment file looks as such:

I wish to say with regard to the initiative of the Portuguese Presidency that we support the spirit and the political intention behind it . In bezug auf die Initiative der portugiesischen Präsidentschaft möchte ich zum Ausdruck bringen , daß wir den Geist und die politische Absicht , die dahinter stehen , unterstützen .   0-0 5-1 5-2 2-3 8-4 7-5 11-6 12-7 1-8 0-9 9-10 3-11 10-12 13-13 13-14 14-15 16-16 17-17 18-18 16-19 20-20 21-21 19-22 19-23 22-24 22-25 23-26 15-27 24-28
It may not be an ideal initiative in terms of its structure but we accept Mr President-in-Office , that it is rooted in idealism and for that reason we are inclined to support it .    Von der Struktur her ist es vielleicht keine ideale Initiative , aber , Herr amtierender Ratspräsident , wir akzeptieren , daß sie auf Idealismus fußt , und sind deshalb geneigt , sie mitzutragen .   0-0 11-2 8-3 0-4 3-5 1-6 2-7 5-8 6-9 12-11 17-12 15-13 16-14 16-15 17-16 13-17 14-18 17-19 18-20 19-21 21-22 23-23 21-24 26-25 24-26 29-27 27-28 30-29 31-30 33-31 32-32 34-33

How can I produce the phrase tables that are used by MOSES from this output?

In this pdf, it explains the consistent phrase extraction: http://www.inf.ed.ac.uk/teaching/courses/mt/lectures/phrase-model.pdf but what is the algorithm to achieve the phrases? (slide 16-21)


Solution

  • The way to get a phrase table is to first extract the phrase table with the following algorithm from Philip Koehn's Statistical MT book, pp. 133:

    enter image description here

    Then estimate the probabilities for the phrases with their relative frequencies, i.e.

    enter image description here

    Note that there is an error in the original printed version of the book but it's addressed in the errata on line 4 of the extract() function.

    Also see Phrase extraction algorithm for statistical machine translation for the details.