I got a problem that CoreNLP can only recognize named entity such as Kobe Bryant that is beginning with a uppercase char, but can't recognize kobe bryant as a person!!! So how to recognize a named entity that is beginning with a lowercase char by CoreNLP ???? Appreciate it !!!!
First off, you do have to accept that it is harder to get named entities right in lowercase or inconsistently cased English text than in formal text, where capital letters are a great clue. (This is also one reason why Chinese NER is harder than English NER.) Nevertheless, there are things that you must do to get CoreNLP working fairly well with lowercase text – the default models are trained to work well on well-edited text.
If you are working with properly edited text, you should use our default English models. If the text that you are working with is (mainly) lowercase or uppercase, then you should use one of the two solutions presented below. If it's a real mixture (like much social media text), you might use the truecaser solution below, or you might gain by using both the cased and caseless NER models (as a long list of models given to the ner.model
property).
Approach 1: Caseless models. We also provide English models that ignore case information. They will work much better on all lowercase text.
Approach 2: Use the truecaser. We provide a truecase
annotator, which attempts to convert text into formally edited capitalization. You can apply it first, and then use the regular annotators.
In general, it's not clear to us that one of these approaches usually or always wins. You can try both.
Important: To have available the extra components invoked below, you need to have downloaded the English models jar, and to have it available on your classpath.
Here's an example. We start with a sample text:
% cat lakers.txt
lonzo ball talked about kobe bryant after the lakers game.
With the default models, no entities are found and all their words just get a common noun tag. Sad!
% java edu.stanford.nlp.pipeline.StanfordCoreNLP -file lakers.txt -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner
% cat lakers.txt.conll
1 lonzo lonzo NN O _ _
2 ball ball NN O _ _
3 talked talk VBD O _ _
4 about about IN O _ _
5 kobe kobe NN O _ _
6 bryant bryant NN O _ _
7 after after IN O _ _
8 the the DT O _ _
9 lakers laker NNS O _ _
10 game game NN O _ _
11 . . . O _ _
Below, we ask to use the caseless models, and then we're doing pretty well: All the name words are now recognized as proper nouns, and the two person names are recognized. But the team name is still missed.
% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner -file lakers.txt -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
% cat lakers.txt.conll
1 lonzo lonzo NNP PERSON _ _
2 ball ball NNP PERSON _ _
3 talked talk VBD O _ _
4 about about IN O _ _
5 kobe kobe NNP PERSON _ _
6 bryant bryant NNP PERSON _ _
7 after after IN O _ _
8 the the DT O _ _
9 lakers lakers NNPS O _ _
10 game game NN O _ _
11 . . . O _ _
Instead, you can run truecasing prior to POS tagging and NER:
% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,truecase,pos,lemma,ner -file lakers.txt -truecase.overwriteText
% cat lakers.txt.conll
1 Lonzo Lonzo NNP PERSON _ _
2 ball ball NN O _ _
3 talked talk VBD O _ _
4 about about IN O _ _
5 Kobe Kobe NNP PERSON _ _
6 Bryant Bryant NNP PERSON _ _
7 after after IN O _ _
8 the the DT O _ _
9 Lakers Lakers NNPS ORGANIZATION _ _
10 game game NN O _ _
11 . . . O _ _
Now, the organization Lakers is recognized, and in general nearly all the entity words are tagged as proper nouns with the correct entity label, but it fails to get ball, which remains a common noun. Of course, this is a fairly hard word to get right in caseless text, since ball is a quite frequent common noun.