I am using the new Stanford CoreNLP NN parser. Here's a simplified version of the code:
// Sentence to be parsed
String sentence = "This is an example sentence.";
// This is where we store the result from the parser. Initially set to "null".
GrammaticalStructure gs = null;
// Parse the sentence
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(sentence));
List<TaggedWord> tagged = null;
for (List<HasWord> sent : tokenizer) {
tagged = tagger.tagSentence(sent);
gs = parser.predict(tagged);
}
// Convert the GrammaticalStructure object (the parsing result) into a semantic graph
SemanticGraph semanticGraph = SemanticGraphFactory.generateUncollapsedDependencies(gs);
Now, when I iterate over the vertices of semanticGraph
, I can get the POS tag, but I can't get the NER of the word nor the begin position. So, when I do this:
for (IndexedWord vertex : new ArrayList<>(semanticGraph.vertexSet())){
String tag = vertex.tag();
String ner = vertex.ner();
int beginPosition = vertex.beginPosition();
}
for tag
I get the POS tag correctly, for ner
I get null
and for beginPostion
I always get -1.
How can I do the parsing with correctly preserving the begin position of the word in the original string? And if possible, how do I get the NER? (beginPosition
is actually more important in my case)
In your case NER tags don't exist because you are not actually performing such an annotation in your code. I am not sure why beginPosition
is not set in the SemanticGraph
Using a StanfordCoreNLP
pipeline is highly recommended for multiple annotations that depend on each other. It's very easy to (re)configure it to use different annotators through a Properties
object. There is also potential for better performance as it can use multiple threads.
Here is a simple example with a pipeline that keeps the for loop from your code. I have tested (CoreNLP 3.5.2) and both ner
and beginPosition
are set correctly. Since no recognizable entities exist in you example sentence ner
is always "O"
. Also if you have more than one sentences in your document, you will have to iterate over sentences
list.
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String sentence = "This is an example sentence.";
Annotation document = new Annotation(sentence);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
CoreMap map = sentences.get(0);
SemanticGraph semanticGraph = map.get(CollapsedCCProcessedDependenciesAnnotation.class);
for (IndexedWord vertex : new ArrayList<>(semanticGraph.vertexSet())) {
String tag = vertex.tag();
String ner = vertex.ner();
int beginPosition = vertex.beginPosition();
}