Search code examples
nlpspacynamed-entity-recognition

Spacy NER doesn't identify lowercase entities


I am facing problem to detect named entities which starts with lowercase letter. I have tried the solution provided on link https://github.com/explosion/spaCy/issues/701. It seems to be not working for me.

===== Info about spaCy=============

spaCy version    2.1.4
Platform         Darwin-16.7.0-x86_64-i386-64bit
Python version   3.6.5
Models           en
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
sk = nlp.vocab[u'south korea']
SK = nlp.vocab[u'South Korea']
sk.is_lower = SK.is_lower
sk.shape = SK.shape
sk.shape_ = SK.shape_
sk.is_upper =SK.is_upper
sk.cluster = SK.cluster
sk.is_title = SK.is_title
doc = nlp(u'south korea is a country in asia')
for word in doc:
    print(word.text, word.tag_, word.ent_type_) 

The expected output is:

south NNP GPE
korea NNP GPE
is VBZ 
a DT 
country NN 
in IN 
asia NNP 

But the output of above code is:

south JJ 
korea NN 
is VBZ 
a DT 
country NN 
in IN 
asia NNP 

Solution

  • The NE recognizer is machine learned and thus relies on the strongest features it sees in the training data.

    You can use a truecaser/recaser, a statical model that fixes casing in lowercased text and pass the output to spacy. You can use:

    Alternatively, you might try to train your recognizer and modify your training data so it also has lower-cased entities, but it is rather a tedious process.