Why do I need a tokenizer for each language?

When processing text, why would one need a tokenizer specialized for the language?

Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?

Solution

Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.

Chinese: 如果您在新加坡只能前往一间夜间娱乐场所，Zouk必然是您的不二之选。

English: If you only have time for one club in Singapore, then it simply has to be Zouk.

Indonesian: Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.

Japanese: シンガポールで一つしかクラブに行く時間がなかったとしたら、このズークに行くべきです。

Korean: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.

Vietnamese: Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.

Text Source: http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf

The tokenized version of the parallel text above should look like this:

enter image description here

For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.

However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read thời_gian (it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would output thời_gian as one token rather than thời and gian.

For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a morphological analyzer is more useful than a tokenizer in Natural Language Processing.