Search code examples
pythonnlpstanford-nlpsegmentchinese-locale

How to not split English into separate letters in the Stanford Chinese Parser


I am using the Stanford Segmenter at http://nlp.stanford.edu/software/segmenter.shtml in Python. For the Chinese segmenter, whenever it encounters a English word, it will split the word into many characters one by one, but I want to keep the characters together after the segmentation is done.

For example:

你好abc我好 

currently will become this after the segmentation

你好 a b c 我 好

but I want it to become

你好 abc 我 好

Is there a way to teach the segmenter to do that? Is there a setting for this?

I Googlged this and to no answer, and tried to hack together a way(spending 6 hours on it) to do this by pulling out English characters from the text and then put them back in after the segmentation is done, but realized it is very difficult to do this in a efficient manner. Any help on this would be greatly appreciated.


Solution

  • I don't know about tokenization in mixed language texts, so I propose to use the following hack: go through the text, until you find English word; all text before this word can be tokenized by Chinese tokenizer; English word can be append as another token; repeat. Below is code sample.

    import re
    pat = re.compile("[A-Za-z]+")
    for sentence in text:
        sent_tokens = []
        prev_end = 0
        for match in re.finditer(pat, sentence):
            print match.start(0), match.end(0), match.group(0)
            chinese_part = sentence[prev_end:match.start(0)]
            sent_tokens += tokenize(chinese_part)
            sent_tokens.append(match.group(0))
            prev_end = match.end(0)
        last_chinese_part = sentence[prev_end:]
        sent_tokens += tokenize(last_chinese_part)
        print sent_tokens
    

    I think efficiency to be comparable with sole tokenization by Chinese tokenizer, since the only overhead is caused by application of regex, which is actually just a finite state automaton and works as O(n).