python stanford-nlp punctuation chinese-locale

Stanford Word Segmenter for Chinese in Python how to return results without punctuation

I am trying to segment a Chinese sentence with the Stanford Word Segmenter in Python, but currently the results has punctuation marks in it. I want to return results without the punctuations, only the words. What is the best way to do that? I tried Googling for an answer, but didn't find anything.

Solution

I think you'd be better off just removing the punctuation after the text has been segmented; I'm fairly sure the Stanford segmenter takes cues from punctuation in doing its job, so you wouldn't want to do so beforehand. The following works for me on UTF-8 text. For Chinese punctuation, use the Zhon library with regex:

import zhon.hanzi
import re
h_regex = re.compile('[%s]' % zhon.hanzi.punctuation)
intxt = # segmented text with punctuation
outtxt = h_regex.sub('', intxt)

And depending on the text you're working with, you may also need to remove non-Chinese punctuation:

import string
p_regex = re.compile('[%s]' % re.escape(string.punctuation))
outtext2 = p_regex.sub('', outtxt)

Then you should be golden.