Search code examples
javanlpstanford-nlp

Get begin poisitions and/or NER from words after parsing


I am using the new Stanford CoreNLP NN parser. Here's a simplified version of the code:

// Sentence to be parsed
String sentence = "This is an example sentence.";

// This is where we store the result from the parser. Initially set to "null".
GrammaticalStructure gs = null;

// Parse the sentence
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(sentence));
List<TaggedWord> tagged = null;
for (List<HasWord> sent : tokenizer) {
    tagged = tagger.tagSentence(sent);
    gs = parser.predict(tagged);
}

// Convert the GrammaticalStructure object (the parsing result) into a semantic graph
SemanticGraph semanticGraph = SemanticGraphFactory.generateUncollapsedDependencies(gs);

Now, when I iterate over the vertices of semanticGraph, I can get the POS tag, but I can't get the NER of the word nor the begin position. So, when I do this:

for (IndexedWord vertex : new ArrayList<>(semanticGraph.vertexSet())){
    String tag = vertex.tag();
    String ner = vertex.ner();
    int beginPosition = vertex.beginPosition();
}

for tag I get the POS tag correctly, for ner I get null and for beginPostion I always get -1.

How can I do the parsing with correctly preserving the begin position of the word in the original string? And if possible, how do I get the NER? (beginPosition is actually more important in my case)


Solution

  • In your case NER tags don't exist because you are not actually performing such an annotation in your code. I am not sure why beginPosition is not set in the SemanticGraph

    Using a StanfordCoreNLP pipeline is highly recommended for multiple annotations that depend on each other. It's very easy to (re)configure it to use different annotators through a Properties object. There is also potential for better performance as it can use multiple threads.

    Here is a simple example with a pipeline that keeps the for loop from your code. I have tested (CoreNLP 3.5.2) and both ner and beginPosition are set correctly. Since no recognizable entities exist in you example sentence ner is always "O". Also if you have more than one sentences in your document, you will have to iterate over sentences list.

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
    String sentence = "This is an example sentence.";
    Annotation document = new Annotation(sentence);
    pipeline.annotate(document);
    
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    CoreMap map = sentences.get(0);
    SemanticGraph semanticGraph = map.get(CollapsedCCProcessedDependenciesAnnotation.class);
    
    for (IndexedWord vertex : new ArrayList<>(semanticGraph.vertexSet())) {
        String tag = vertex.tag();
        String ner = vertex.ner();
        int beginPosition = vertex.beginPosition();
    }