Search code examples
nlpspacynamed-entity-recognition

POS tagging and NER for Chinese Text with Spacy


  • I am trying to print the entities and pos present in Chinese text.
  • I have installed # !pip3 install jieba and used Google colab for the below script.

But I am getting empty tuples for the entities and no results for pos_.

from spacy.lang.zh import Chinese

nlp = Chinese()
doc = nlp(u"蘋果公司正考量用一億元買下英國的新創公司")

doc.ents
# returns (), i.e. empty tuple


for word in doc:
    print(word.text, word.pos_)

''' returns
蘋果 
公司 
正 
考量 
用 
一 
億元 
買 
下 
英國 
的 
新創 
公司 
'''

I am new to NLP. I want to know what is the correct way to do ?


Solution

  • EDIT 3/21: Spacy now supports NER and POS tagging for CN

    Find the SpaCy model here: https://spacy.io/models/zh

    OLD ANSWER:

    SpaCy is a fantastic package, but as of yet does not support Chinese, so I assume thats the reason you dont get POS results - even though your sentence is

    "Apple is looking at buying U.K. startup for $1 billion"

    in traditional Chinese and should therefore return "Apple" and "U.K." as ent, among others.

    For a more extensive NLP approach to traditional Chinese, you can try using the Stanford Chinese NLP package - you are using python, and there are versions available for python (see a demo script or an intro on Medium), but the original is Java, if you are more comfortable with that.