Search code examples
pythonpython-3.xnlpnltkstanford-nlp

Unexpected format when running StanfordPOSTagger with NLTK for Chinese


I have installed Python 3.6.0, NLTK 3.2.4, and downloaded Stanford POS Tagger 3.8.0.

Then I tried running the following script:

#!/usr/bin/env python3

from nltk.tag import StanfordPOSTagger


st = StanfordPOSTagger('chinese-distsim.tagger')
print(st.tag('这 是 斯坦福 中文 分词器 测试'.split()))

and the output is in an unexpected format:

[('', '这#PN'), ('', '是#VC'), ('', '斯坦福#NR'), ('', '中文#NN'), ('', '分词器#NN'), ('', '测试#NN')]

The tagger does do its job, but the words and their parts of speech are not separated as a pair, but joined by a '#' to form single strings. Is this the format specially for Chinese, or is there something wrong?


Solution

  • TL;DR

    Set a different _SEPARATOR:

    from nltk.tag import StanfordPOSTagger
    
    st = StanfordPOSTagger('chinese-distsim.tagger')
    st._SEPARATOR = '#'
    print(st.tag('这 是 斯坦福 中文 分词器 测试'.split()))
    

    Better Solution

    Hold out for a while, wait for NLTK v3.2.5 where there will be a very simple interface to the Stanford tokenizers that are standardize across different languages.

    There'll be no delimiter involved since the tags and tokens are transferred through a json from a REST interface =)

    Also, the StanfordSegmenter and StanfordTokenizer classes will be deprecated in v3.2.5, see

    First upgrade your nltk version:

    pip install -U nltk
    

    Download and start the Stanford CoreNLP server:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 
    
    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-chinese.properties \
    -preload tokenize,ssplit,pos,lemma,ner,parse \
    -status_port 9001  -port 9001 -timeout 15000
    

    Then in NLTK v3.2.5:

    >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger
    >>> from nltk.tokenize.stanford import CoreNLPTokenizer
    >>> stpos, stner = CoreNLPPOSTagger('http://localhost:9001'), CoreNLPNERTagger('http://localhost:9001')
    >>> sttok = CoreNLPTokenizer('http://localhost:9001')
    
    >>> sttok.tokenize(u'我家没有电脑。')
    ['我家', '没有', '电脑', '。']
    
    # Without segmentation (input to`raw_string_parse()` is a list of single char strings)
    >>> stpos.tag(u'我家没有电脑。')
    [('我', 'PN'), ('家', 'NN'), ('没', 'AD'), ('有', 'VV'), ('电', 'NN'), ('脑', 'NN'), ('。', 'PU')]
    # With segmentation
    >>> stpos.tag(sttok.tokenize(u'我家没有电脑。'))
    [('我家', 'NN'), ('没有', 'VE'), ('电脑', 'NN'), ('。', 'PU')]
    
    # Without segmentation (input to`raw_string_parse()` is a list of single char strings)
    >>> stner.tag(u'奥巴马与迈克尔·杰克逊一起去杂货店购物。')
    [('奥', 'GPE'), ('巴', 'GPE'), ('马', 'GPE'), ('与', 'O'), ('迈', 'O'), ('克', 'PERSON'), ('尔', 'PERSON'), ('·', 'O'), ('杰', 'O'), ('克', 'O'), ('逊', 'O'), ('一', 'NUMBER'), ('起', 'O'), ('去', 'O'), ('杂', 'O'), ('货', 'O'), ('店', 'O'), ('购', 'O'), ('物', 'O'), ('。', 'O')]
    # With segmentation
    >>> stner.tag(sttok.tokenize(u'奥巴马与迈克尔·杰克逊一起去杂货店购物。'))
    [('奥巴马', 'PERSON'), ('与', 'O'), ('迈克尔·杰克逊', 'PERSON'), ('一起', 'O'), ('去', 'O'), ('杂货店', 'O'), ('购物', 'O'), ('。', 'O')]