Search code examples
google-translate

Getting weird markup from Google translate like ~~POS=TRUNC


I'm suddenly getting same strange markup when translating phrases in Google Translate API via the Java library. Examples for English → Swedish include:

Vector graphics → vektor~~POS=TRUNC grafikk~~POS=HEADCOMP

Javascript → Javascript script~~POS=HEADCOMP

It looks like it's related to compound noun handling. Is this a feature of the API that I can deactivate somehow or is this a new bug on the server side?


Solution

  • This looks like a bug in the server-side translator. I also get it on the web site, https://translate.google.com/#view=home&op=translate&sl=ru&tl=no&text=%D0%9E%D0%B1%D1%89%D0%B5%D0%B6%D0%B8%D1%82%D0%B8%D0%B5 gives me vandrer~~POS=TRUNC.

    In NLP, "POS" means Part-Of-Speech, "HEADCOMP" sounds like it could be the head of a noun-compound, I'm guessing they TRUNCate the non-head parts of compounds (practically never inflected). So Google Translate is spilling some of its internals. What's surprising is that such tags are the staple of rule-based/knowledge-based systems, whereas Google typically only does pure machine learning methods, shunning hard-coded knowledge. (One possibility is that they used a noun-compound analyser to expand their training set (which they then ran ML on, similar to how Systran & Koehn trained statistical MT on a parallel corpus translated with a rule-based MT system), but had a bug in the script to clean up the tags before training.)

    It'd be fun to find out which system they used, in case it was an open source one, but unfortunately the tags are practically ungoogleable, since the web is now littered with spammy machine translated (and non-post-edited) pages full of those tags.