Search code examples
javastanford-nlp

How to recognize a named entity that is lowcase such as kobe bryant by CoreNLP?


I got a problem that CoreNLP can only recognize named entity such as Kobe Bryant that is beginning with a uppercase char, but can't recognize kobe bryant as a person!!! So how to recognize a named entity that is beginning with a lowercase char by CoreNLP ???? Appreciate it !!!!


Solution

  • First off, you do have to accept that it is harder to get named entities right in lowercase or inconsistently cased English text than in formal text, where capital letters are a great clue. (This is also one reason why Chinese NER is harder than English NER.) Nevertheless, there are things that you must do to get CoreNLP working fairly well with lowercase text – the default models are trained to work well on well-edited text.

    If you are working with properly edited text, you should use our default English models. If the text that you are working with is (mainly) lowercase or uppercase, then you should use one of the two solutions presented below. If it's a real mixture (like much social media text), you might use the truecaser solution below, or you might gain by using both the cased and caseless NER models (as a long list of models given to the ner.model property).

    Approach 1: Caseless models. We also provide English models that ignore case information. They will work much better on all lowercase text.

    Approach 2: Use the truecaser. We provide a truecase annotator, which attempts to convert text into formally edited capitalization. You can apply it first, and then use the regular annotators.

    In general, it's not clear to us that one of these approaches usually or always wins. You can try both.

    Important: To have available the extra components invoked below, you need to have downloaded the English models jar, and to have it available on your classpath.

    Here's an example. We start with a sample text:

    % cat lakers.txt
    lonzo ball talked about kobe bryant after the lakers game.
    

    With the default models, no entities are found and all their words just get a common noun tag. Sad!

    % java edu.stanford.nlp.pipeline.StanfordCoreNLP -file lakers.txt -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner
    % cat lakers.txt.conll 
    1   lonzo   lonzo   NN  O   _   _
    2   ball    ball    NN  O   _   _
    3   talked  talk    VBD O   _   _
    4   about   about   IN  O   _   _
    5   kobe    kobe    NN  O   _   _
    6   bryant  bryant  NN  O   _   _
    7   after   after   IN  O   _   _
    8   the the DT  O   _   _
    9   lakers  laker   NNS O   _   _
    10  game    game    NN  O   _   _
    11  .   .   .   O   _   _
    

    Below, we ask to use the caseless models, and then we're doing pretty well: All the name words are now recognized as proper nouns, and the two person names are recognized. But the team name is still missed.

    % java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner -file lakers.txt -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
    % cat lakers.txt.conll 
    1   lonzo   lonzo   NNP PERSON  _   _
    2   ball    ball    NNP PERSON  _   _
    3   talked  talk    VBD O   _   _
    4   about   about   IN  O   _   _
    5   kobe    kobe    NNP PERSON  _   _
    6   bryant  bryant  NNP PERSON  _   _
    7   after   after   IN  O   _   _
    8   the the DT  O   _   _
    9   lakers  lakers  NNPS    O   _   _
    10  game    game    NN  O   _   _
    11  .   .   .   O   _   _
    

    Instead, you can run truecasing prior to POS tagging and NER:

    % java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,truecase,pos,lemma,ner -file lakers.txt -truecase.overwriteText
    % cat lakers.txt.conll 
    1   Lonzo   Lonzo   NNP PERSON  _   _
    2   ball    ball    NN  O   _   _
    3   talked  talk    VBD O   _   _
    4   about   about   IN  O   _   _
    5   Kobe    Kobe    NNP PERSON  _   _
    6   Bryant  Bryant  NNP PERSON  _   _
    7   after   after   IN  O   _   _
    8   the the DT  O   _   _
    9   Lakers  Lakers  NNPS    ORGANIZATION    _   _
    10  game    game    NN  O   _   _
    11  .   .   .   O   _   _
    

    Now, the organization Lakers is recognized, and in general nearly all the entity words are tagged as proper nouns with the correct entity label, but it fails to get ball, which remains a common noun. Of course, this is a fairly hard word to get right in caseless text, since ball is a quite frequent common noun.