Search code examples
azuretraining-datamicrosoft-translator

Microsoft translator engine customization: parallel txt files


I am trying to perform some NMT engine customization for Japanese but I am having some difficulties uploading parallel txt files. I've gathered 10k parallel sentences and I've put them into two txt files:

line count using <code>wc -l</code>

As the guide suggested, I've also been careful to remove sentences containing the \n and \r characters in them, but upon uploading I get the following:

upload line count mismatch

What's wrong?


Solution

  • We display the sentence counts because the model training engine operates at the sentence level. The expected format of the txt parallel file set is one sentence for each line. During the upload process we do run a sentence breaker which identifies end of sentence markers and breaks accordingly. This is why the count of sentences do not always match the count of lines. Sentences are the units we operate on, not lines of the input file. That's why we focus on sentences rather than lines.

    This is also why we suggest removing newline characters within sentences. The newline is considered an end of sentence marker, so having newlines within a sentence creates a false sentence break.

    In response to your second concern, we do run a sentence aligning process on most data that is submitted. If there is an inconsistent number of sentences in the uploaded parallel files we can usually get most of the sentence pairs, as long as the sentences are fairly close.