I'm creating a custom model out of a training set in Microsoft Translator Text for Japanese (JA) to English (EN) translation. Should the training data be tokenized, and is all lowercase preferable?
In Japanese the quotation characters (「」 and 『』)are different than in English. In JA training data should these be tokenized (separated by a space)? In parallel EN training data should the EN quotation marks ("") be used, or the JA quotation marks?
Beyond that, is any other pre-processing desirable such as transforming the text to all lowercase? The text casing returned by the model when deployed does not matter.
Leave the training material as you would present it to a human reader, with casing and punctuation intact. Casing and punctuation matter in translation, it is a relevant signal for the engine to receive. No reason to apply your own tokenization, it would interfere with the system's tokenization. The best training material is sentence- or segment aligned, like you would get it in a TMX or XLIFF in an export from a TM.