Search code examples
pythonnlpstanford-nlplinguisticspycorenlp

Anaphora resolution in stanford-nlp using python


I am trying to do anaphora resolution and for that below is my code.

first i navigate to the folder where i have downloaded the stanford module. Then i run the command in command prompt to initialize stanford nlp module

java -mx4g -cp "*;stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

After that i execute below code in Python

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

I want to change the sentence Tom is a smart boy. He know a lot of thing. into Tom is a smart boy. Tom know a lot of thing. and there is no tutorial or any help available in Python.

All i am able to do is annotate by below code in Python

coreference resolution

output = nlp.annotate(sentence, properties={'annotators':'dcoref','outputFormat':'json','ner.useSUTime':'false'})

and by parsing for coref

coreferences = output['corefs']

i get below JSON

coreferences

{u'1': [{u'animacy': u'ANIMATE',
   u'endIndex': 2,
   u'gender': u'MALE',
   u'headIndex': 1,
   u'id': 1,
   u'isRepresentativeMention': True,
   u'number': u'SINGULAR',
   u'position': [1, 1],
   u'sentNum': 1,
   u'startIndex': 1,
   u'text': u'Tom',
   u'type': u'PROPER'},
  {u'animacy': u'ANIMATE',
   u'endIndex': 6,
   u'gender': u'MALE',
   u'headIndex': 5,
   u'id': 2,
   u'isRepresentativeMention': False,
   u'number': u'SINGULAR',
   u'position': [1, 2],
   u'sentNum': 1,
   u'startIndex': 3,
   u'text': u'a smart boy',
   u'type': u'NOMINAL'},
  {u'animacy': u'ANIMATE',
   u'endIndex': 2,
   u'gender': u'MALE',
   u'headIndex': 1,
   u'id': 3,
   u'isRepresentativeMention': False,
   u'number': u'SINGULAR',
   u'position': [2, 1],
   u'sentNum': 2,
   u'startIndex': 1,
   u'text': u'He',
   u'type': u'PRONOMINAL'}],
 u'4': [{u'animacy': u'INANIMATE',
   u'endIndex': 7,
   u'gender': u'NEUTRAL',
   u'headIndex': 4,
   u'id': 4,
   u'isRepresentativeMention': True,
   u'number': u'SINGULAR',
   u'position': [2, 2],
   u'sentNum': 2,
   u'startIndex': 3,
   u'text': u'a lot of thing',
   u'type': u'NOMINAL'}]}

Any help on this?


Solution

  • Here is one possible solution that uses the data structure output by CoreNLP. All the information is provided. This is not intended as a full solution and extensions are probably required to deal with all situations, but this is a good starting point.

    from pycorenlp import StanfordCoreNLP
    
    nlp = StanfordCoreNLP('http://localhost:9000')
    
    
    def resolve(corenlp_output):
        """ Transfer the word form of the antecedent to its associated pronominal anaphor(s) """
        for coref in corenlp_output['corefs']:
            mentions = corenlp_output['corefs'][coref]
            antecedent = mentions[0]  # the antecedent is the first mention in the coreference chain
            for j in range(1, len(mentions)):
                mention = mentions[j]
                if mention['type'] == 'PRONOMINAL':
                    # get the attributes of the target mention in the corresponding sentence
                    target_sentence = mention['sentNum']
                    target_token = mention['startIndex'] - 1
                    # transfer the antecedent's word form to the appropriate token in the sentence
                    corenlp_output['sentences'][target_sentence - 1]['tokens'][target_token]['word'] = antecedent['text']
    
    
    def print_resolved(corenlp_output):
        """ Print the "resolved" output """
        possessives = ['hers', 'his', 'their', 'theirs']
        for sentence in corenlp_output['sentences']:
            for token in sentence['tokens']:
                output_word = token['word']
                # check lemmas as well as tags for possessive pronouns in case of tagging errors
                if token['lemma'] in possessives or token['pos'] == 'PRP$':
                    output_word += "'s"  # add the possessive morpheme
                output_word += token['after']
                print(output_word, end='')
    
    
    text = "Tom and Jane are good friends. They are cool. He knows a lot of things and so does she. His car is red, but " \
           "hers is blue. It is older than hers. The big cat ate its dinner."
    
    output = nlp.annotate(text, properties= {'annotators':'dcoref','outputFormat':'json','ner.useSUTime':'false'})
    
    resolve(output)
    
    print('Original:', text)
    print('Resolved: ', end='')
    print_resolved(output)
    

    This gives the following output:

    Original: Tom and Jane are good friends. They are cool. He knows a lot of things and so does she. His car is red, but hers is blue. It is older than hers. The big cat ate his dinner.
    Resolved: Tom and Jane are good friends. Tom and Jane are cool. Tom knows a lot of things and so does Jane. Tom's car is red, but Jane's is blue. His car is older than Jane's. The big cat ate The big cat's dinner.
    

    As you can see, this solution doesn't deal with correcting the case when a pronoun has a sentence-initial (title-case) antecedent ("The big cat" instead of "the big cat" in the last sentence). This depends on the category of the antecedent - common noun antecedents need lowercasing, while proper noun antecedents wouldn't. Some other ad hoc processing might be necessary (as for the possessives in my test sentence). It also presupposes that you will not want to reuse the original output tokens, as they are modified by this code. A way around this would be to make a copy of the original data structure or create a new attribute and change the print_resolved function accordingly. Correcting any resolution errors is also another challenge!