Search code examples
pythonstringunicodemultilingualcjk

Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings?


I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).

Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.

And I want to put of all those separated components into a list.

Some examples would probably make this clear:

Case 1: English-only string. This case is easy:

>>> "I love Python".split()
['I', 'love', 'Python']

Case 2: Chinese-only string:

>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:

[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

How do I get it to display the actual characters instead of the unicode? Something like:

['我', '爱', '蟒', '蛇']

??

Case 3: A mix of English & Chinese:

I want to turn an input string such as

"我爱Python"

and turns it into a list like this:

['我', '爱', 'Python']

Is it possible to do something like that?


Solution

  • I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)

    # -*- coding: utf-8 -*-
    import re
    def group_words(s):
        regex = []
    
        # Match a whole word:
        regex += [ur'\w+']
    
        # Match a single CJK character:
        regex += [ur'[\u4e00-\ufaff]']
    
        # Match one of anything else, except for spaces:
        regex += [ur'[^\s]']
    
        regex = "|".join(regex)
        r = re.compile(regex)
    
        return r.findall(s)
    
    if __name__ == "__main__":
        print group_words(u"Testing English text")
        print group_words(u"我爱蟒蛇")
        print group_words(u"Testing English text我爱蟒蛇")
    

    In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.