Search code examples
pythonstanford-nlppunctuationchinese-locale

Stanford Word Segmenter for Chinese in Python how to return results without punctuation


I am trying to segment a Chinese sentence with the Stanford Word Segmenter in Python, but currently the results has punctuation marks in it. I want to return results without the punctuations, only the words. What is the best way to do that? I tried Googling for an answer, but didn't find anything.


Solution

  • I think you'd be better off just removing the punctuation after the text has been segmented; I'm fairly sure the Stanford segmenter takes cues from punctuation in doing its job, so you wouldn't want to do so beforehand. The following works for me on UTF-8 text. For Chinese punctuation, use the Zhon library with regex:

    import zhon.hanzi
    import re
    h_regex = re.compile('[%s]' % zhon.hanzi.punctuation)
    intxt = # segmented text with punctuation
    outtxt = h_regex.sub('', intxt)
    

    And depending on the text you're working with, you may also need to remove non-Chinese punctuation:

    import string
    p_regex = re.compile('[%s]' % re.escape(string.punctuation))
    outtext2 = p_regex.sub('', outtxt)
    

    Then you should be golden.