Search code examples
pythonnlpspacynamed-entity-recognitiondependency-parsing

Named Entity Recognition in aspect-opinion extraction using dependency rule matching


Using Spacy, I extract aspect-opinion pairs from a text, based on the grammar rules that I defined. Rules are based on POS tags and dependency tags, which is obtained by token.pos_ and token.dep_. Below is an example of one of the grammar rules. If I pass the sentence Japan is cool, it returns [('Japan', 'cool', 0.3182)], where the value represents the polarity of cool.

However I don't know how I can make it recognise the Named Entities. For example, if I pass Air France is cool, I want to get [('Air France', 'cool', 0.3182)] but what I currently get is [('France', 'cool', 0.3182)].

I checked Spacy online documentation and I know how to extract NE(doc.ents). But I want to know what the possible workaround is to make my extractor work. Please note that I don't want a forced measure such as concatenating strings AirFrance, Air_France etc.

Thank you!

import spacy

nlp = spacy.load("en_core_web_lg-2.2.5")
review_body = "Air France is cool."
doc=nlp(review_body)

rule3_pairs = []

for token in doc:

    children = token.children
    A = "999999"
    M = "999999"
    add_neg_pfx = False

    for child in children :
        if(child.dep_ == "nsubj" and not child.is_stop): # nsubj is nominal subject
            A = child.text

        if(child.dep_ == "acomp" and not child.is_stop): # acomp is adjectival complement
            M = child.text

        # example - 'this could have been better' -> (this, not better)
        if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
            neg_prefix = "not"
            add_neg_pfx = True

        if(child.dep_ == "neg"): # neg is negation
            neg_prefix = child.text
            add_neg_pfx = True

    if (add_neg_pfx and M != "999999"):
        M = neg_prefix + " " + M

    if(A != "999999" and M != "999999"):
        rule3_pairs.append((A, M, sid.polarity_scores(M)['compound']))

Result

rule3_pairs
>>> [('France', 'cool', 0.3182)]

Desired output

rule3_pairs
>>> [('Air France', 'cool', 0.3182)]

Solution

  • It's very easy to integrate entities in your extractor. For every pair of children, you should check whether the "A" child is the head of some named entity, and if it is true, you use the whole entity as your object.

    Here I provide the whole code

    !python -m spacy download en_core_web_lg
    import nltk
    nltk.download('vader_lexicon')
    
    import spacy
    nlp = spacy.load("en_core_web_lg")
    
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    sid = SentimentIntensityAnalyzer()
    
    
    def find_sentiment(doc):
        # find roots of all entities in the text
        ner_heads = {ent.root.idx: ent for ent in doc.ents}
        rule3_pairs = []
        for token in doc:
            children = token.children
            A = "999999"
            M = "999999"
            add_neg_pfx = False
            for child in children:
                if(child.dep_ == "nsubj" and not child.is_stop): # nsubj is nominal subject
                    if child.idx in ner_heads:
                        A = ner_heads[child.idx].text
                    else:
                        A = child.text
                if(child.dep_ == "acomp" and not child.is_stop): # acomp is adjectival complement
                    M = child.text
                # example - 'this could have been better' -> (this, not better)
                if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
                    neg_prefix = "not"
                    add_neg_pfx = True
                if(child.dep_ == "neg"): # neg is negation
                    neg_prefix = child.text
                    add_neg_pfx = True
            if (add_neg_pfx and M != "999999"):
                M = neg_prefix + " " + M
            if(A != "999999" and M != "999999"):
                rule3_pairs.append((A, M, sid.polarity_scores(M)['compound']))
        return rule3_pairs
    
    print(find_sentiment(nlp("Air France is cool.")))
    print(find_sentiment(nlp("I think Gabriel García Márquez is not boring.")))
    print(find_sentiment(nlp("They say Central African Republic is really great. ")))
    

    The output of this code will be what you need:

    [('Air France', 'cool', 0.3182)]
    [('Gabriel García Márquez', 'not boring', 0.2411)]
    [('Central African Republic', 'great', 0.6249)]
    

    Enjoy!