django gettext google-translate translate machine-translation

Google Translator Toolkit machine-translation issues

I am using python 2.7 & django 1.7.

When I use Google Translator Toolkit to machine translate my .po files to another language (English to German), there are many errors due to the use of different django template variables in my translation tags.

I understand that machine translation is not so great, but I am wanting to only test my translation strings on my test pages.

Here is an example of a typical error of the machine-translated .po file translated from English (en) to German (de).

#. Translators: {{ site_name_lowercase }} is a variable that does not require translation.
#: .\templates\users\reset_password_email_html.txt:47
#: .\templates\users\reset_password_email_txt.txt:18
#, python-format
msgid ""
"Once you've returned to %(site_name_lowercase)s.com, we will give you "
"instructions to reset your password."
msgstr "Sobald du mit% (site_name_lowercase) s.com zurückgegeben haben, geben wir Ihnen Anweisungen, um Ihr Passwort zurückzusetzen."

The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.

I have hundreds of these type of errors and I estimate that a find & replace would take at least 7 hours. Plus if I makemessages and then translate the .po file again I would have to go through the find and replace again.

I am hoping that there is some type of undocumented rule in the Google Translator Toolkit that will allow the machine-translation to ignore the variables. I have read the Google Translator Toolkit docs and searched SO & Google, but did not find anything that would assist me.

Does anyone have any suggestions?

Solution

The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.

This is caused by tokenization prior to translation, followed by detokenization after translation, i.e. Google Translate tries to split the input before translation to re-merge it after translation. The variables you use are typically composed of characters that are used by tokenizers to detect token boundaries. To avoid this sort of problem, you can pre-process your file and replace the offending variables by placeholders that do not have this issue - I suggest you try out a couple of things, e.g. _VAR_PLACE_HOLDER_. It is important that you do not use any punctuation characters that may cause the tokenizer to split. After pre-processing, translate your newly generated file and post-process by replacing the placeholders by their original value. Typically, your placeholder will be picked up as an Out-Of-Vocabulary (OOV) item and it will be preserved during translation. Try to experiment with including a sequence number (to keep track of your placeholders during post-processing), since word reordering may occur. There used to be a scientific API for Google Translate that gives you the token alignments. You could use these for post-processing as well.

Note that this procedure will not give you the best translation output possible, as the language model will not recognize the placeholder. You can see this illustrated here (with placeholder, the token "gelezen" is in the wrong place):

https://translate.google.com/#en/nl/I%20have%20read%20SOME_VARIABLE_1%20some%20time%20ago%0AI%20have%20read%20a%20book%20some%20time%20ago

If you just want to test the system for your variables, and you do not care about the translation quality, this is the fastest way to go.

Should you decide to go for a better solution, you can solve this issue yourself by developing your own machine translation system (it's fun, by the way, see http://www.statmt.org/moses/) and apply the procedure explained above, but then with, for example Part-Of-Speech-Tags to improve the language model. Note that you can use the alignment information as well.