Search code examples
stanford-nlp

Stanford NLP named entities of more than one token


I'm experimenting with Stanford Core NLP for named entity recognition.

Some named entities consist of more than one token, for example, Person: "Bill Smith". I can't figure out what API calls to use to determine when "Bill" and "Smith" should be considered a single entity, and when they should be two different entities.

Is there some decent documentation somewhere which explains this?

Here's my current code:

    InputStream is = getClass().getResourceAsStream(MODEL_NAME);
    if (MODEL_NAME.endsWith(".gz")) {
        is = new GZIPInputStream(is);
    }
    is = new BufferedInputStream(is);

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

    AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifier(is);
    is.close();

    String text = "Hello, Bill Smith, how are you?";

    List<List<CoreLabel>> sentences = classifier.classify(text);
    for (List<CoreLabel> sentence: sentences) {
        for (CoreLabel word: sentence) {
            String type = word.get(CoreAnnotations.AnswerAnnotation.class);
            System.out.println(word + " is of type " + type);
        }
    }

Also, it isn't clear to me why the "PERSON" annotation is coming back as AnswerAnnotation, instead of CoreAnnotations.EntityClassAnnotation, EntityTypeAnnotation, or something else.


Solution

  • You should use the "entitymentions" annotator, which will mark continuous sequences of tokens with the same ner tag as an entity. The list of entities for each sentence will be stored under the CoreAnnotations.MentionsAnnotation.class key. Each entity mention itself will be a CoreMap.

    Looking over this code could help:

    https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/EntityMentionsAnnotator.java

    some sample code:

    import java.io.*;
    import java.util.*;
    import edu.stanford.nlp.ling.*;
    import edu.stanford.nlp.pipeline.*;
    import edu.stanford.nlp.util.*;
    
    
    
    public class EntityMentionsExample {
    
      public static void main (String[] args) throws IOException {
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        String text = "Joe Smith is from Florida.";
        Annotation annotation = new Annotation(text);
        pipeline.annotate(annotation);
        System.out.println("---");
        System.out.println("text: " + text);
        for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
          for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
            System.out.print(entityMention.get(CoreAnnotations.TextAnnotation.class));
            System.out.print("\t");
            System.out.print(
                    entityMention.get(CoreAnnotations.NamedEntityTagAnnotation.class));
            System.out.println();
          }
        }
      }
    }