Search code examples
algorithminternationalizationtranslationtext-processing

Algorithm to estimate number of English translation words from Japanese source


I'm trying to come up with a way to estimate the number of English words a translation from Japanese will turn into. Japanese has three main scripts -- Kanji, Hiragana, and Katakana -- and each has a different average character-to-word ratio (Kanji being the lowest, Katakana the highest).

Examples:

  • computer: コンピュータ (Katakana - 6 characters); 計算機 (Kanji: 3 characters)
  • whale: くじら (Hiragana -- 3 characters); 鯨 (Kanji: 1 character)

As data, I have a large glossary of Japanese words and their English translations, and a fairly large corpus of matched Japanese source documents and their English translations. I want to come up with a formula that will count numbers of Kanji, Hiragana, and Katakana characters in a source text, and estimate the number of English words this is likely to turn into.


Solution

  • I would start with linear approximation: approx_english_words = a1*no_characters_in_script1 + a2 * no_chars_in_script2 + a3 * no_chars_in_script3, with the coefficients a1, a2, a3 fit from your data using linear least squares.

    If this doesn't approximate very well, then look at the worst cases for the reasons they don't fit (specialized words, etc.).