Search code examples
ibm-cloudibm-watsonmachine-translation

IBM Watson Language Translation - correct way to train using parallel corpus


I have a bunch of articles that are translated, which I want to use as training data for IBM Watson language translation. What is the correct way to use these articles for training? Do I use the whole article and its translation as an entry in the parallel corpus, or do I have to split the article into sentences and have its translation pair as an entry?


Solution

  • You have two choices.

    Either split up the text into phrase pairs with a from and to for each phrase, and create either a forced_glossary or a parallel_corpus.

    Or send all the translated text as a single file to create a monolingual_corpus.

    Detailed documentation is available at https://www.ibm.com/watson/developercloud/doc/language-translator/customizing.html#training and the API documentation is available at https://www.ibm.com/watson/developercloud/language-translator/api/v2/?curl#create-model