Search code examples
javanlpstanford-nlp

Stanford Core NLP - understanding coreference resolution


I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools. As an example, below is a sentence and the corresponding CorefChainAnnotation:

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either.

Thank you


Solution

  • The first number is a cluster id (representing tokens, which stand for the same entity), see source code of SieveCoreferenceSystem#coref(Document). The pair numbers are outout of CorefChain#toString():

    public String toString(){
        return position.toString();
    }
    

    where position is a set of postion pairs of entity mentioning (to get them use CorefChain.getCorefMentions()). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:

    class Example {
        public static void main(String[] args) {
            Properties props = new Properties();
            props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            props.put("dcoref.score", true);
            pipeline = new StanfordCoreNLP(props);
            Annotation document = new Annotation("The atom is a basic unit of matter, it   consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");
    
            pipeline.annotate(document);
            Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
    
            println aText
    
            for(Map.Entry<Integer, CorefChain> entry : graph) {
              CorefChain c =   entry.getValue();                
              println "ClusterId: " + entry.getKey();
              CorefMention cm = c.getRepresentativeMention();
              println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);
    
              List<CorefMention> cms = c.getCorefMentions();
              println  "Mentions:  ";
              cms.each { it -> 
                  print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
              }         
            }
        }
    }
    

    Output (I do not understand where 's' comes from):

    The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
    ClusterId: 1
    Representative Mention: he
    Mentions: he|atom |s|
    ClusterId: 6
    Representative Mention:  basic unit 
    Mentions:  basic unit |
    ClusterId: 8
    Representative Mention:  unit 
    Mentions:  unit |
    ClusterId: 10
    Representative Mention: it 
    Mentions: it |