Search code examples
nlp

NLP: Get opinionated terms that correspond to aspect terms


I want to extract the sentiment sentence that goes along an aspect term in a sentence. I have the following code:

import spacy
nlp = spacy.load("en_core_web_lg")

def find_sentiment(doc):
    # find roots of all entities in the text
    ner_heads = {ent.root.idx: ent for ent in doc.ents}
    rule3_pairs = []
    for token in doc:
        children = token.children
        A = "999999"
        M = "999999"
        add_neg_pfx = False
        for child in children:
            if(child.dep_ in ["nsubj"] and not child.is_stop): # nsubj is nominal subject
                if child.idx in ner_heads:
                    A = ner_heads[child.idx].text
                else:
                    A = child.text
            if(child.dep_ in ["acomp", "advcl"] and not child.is_stop): # acomp is adjectival complement
                M = child.text
            # example - 'this could have been better' -> (this, not better)
            if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
                neg_prefix = "not"
                add_neg_pfx = True
            if(child.dep_ == "neg"): # neg is negation
                neg_prefix = child.text
                add_neg_pfx = True
            # print(child, child.dep_)
        if (add_neg_pfx and M != "999999"):
            M = neg_prefix + " " + M
        if(A != "999999" and M != "999999"):
            rule3_pairs.append((A, M))

    return rule3_pairs


print(find_sentiment(nlp('NEW DELHI Refined soya oil remained weak for the second day and prices shed 0.56 per cent to Rs 682.50 per 10 kg in futures market today as speculators reduced positions following sluggish demand in the spot market against adequate stocks position.')))

Which gets me the output: [('oil', 'weak'), ('prices', 'reduced')]

But this is too little of the content of the text

I want to know if it is possible to get an output like: [('oil', 'weak'), ('prices', 'shed 0.56 percent'), ('demand', 'sluggish')]

Is there any approach you recomend trying?

I triedthe code given above. Also a another library of stanza which only got similar results.


Solution

  • Unfortunately, if your task is to extract all expressive words from the text (all the words that contain sentimental significance), then it is not possible with the current state of affairs. Language is highly variable, and the same word could change its sentiment and meaning from sentence to sentence. While words like "awful" are easy to classify as negative, "demand" from your text is not as obvious, not even speaking about edge cases when seemingly positive "incredible" may reverse its sentiment if used as empowerment: "incredibly stupid" should be classified as very negative, but machines can normally only output two opposite labels for those words.

    This is why for purposes of sentimental analysis, the only reliable way is building machine learning model that will classify texts entirely, which means you should adapt your software to accept the final verdict and process it in some way or another.

    Naive Bayes Classifier

    The simplest way to classify text by sentiment is the Naive Bayes classifier algorithm (that, among other things, not only classifies sentiment) that is implemented in NLTK:

    from nltk import NaiveBayesClassifier, classify
    #The training data is a two-dimensional list of words to classify.
    train_data = dataset[:7000]
    test_data = dataset[7000:]
    #Train method returns the trained model.
    classifier = NaiveBayesClassifier.train(train_data) 
    #To get accuracy, use classify.accuracy method:
    print("Accuracy is:", classify.accuracy(classifier, test_data)) 
    

    In order to make a prediction, we need to pass a list of words. It's preferable to remove any words that do not play sentimental significance such as the stop words and punctuation so that it wouldn't disturb our model:

    from nltk.corpus import stopwords
    from nltk.tokenise import word_tokenise
    def clearLexemes(words):
        return [word if word not in stopwords.word("english")
            or "!?<>:;.&*%^" in word for word in words]
    text = "What a terrible day!"
    tokens = clearLexemes(word_tokenise(text))
    print("Text sentiment is " + str(classifier.classify(dict([token, True] for token in tokens)))))
    

    The output will be the sentiment of the text.
    The important notes:

    • requires a minimum parameters to train and trains relatively fast;
    • is highly efficient for working with natural languages (is also used for gender identification and named entity recognition);
    • is unlikely to properly classify edge cases when words shift their sentiment in creatively-styled or rare utterances. For example, "Sweetheart, I wish ll of your fears would come true and you will be happy to live in such world!" This sentence is negative and uses irony to mask negative attribute through positive expressions, and the model may not be able to detect this.

    Linear Regression

    Another related method is to use linear regression algorithms from your favourite machine learning framework. In this notebook I used the Amazon food review dataset to measure how fast model accuracy increases as you feed it with more and more data. The data you need to feed the model is the raw text and its score label (that in your case could be sentiment).

    import numpy as np #For converting strings to text
    import pandas as pd 
    from sklearn.linear_model import LogisticRegression
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.metrics import confusion_matrix, classification_report
    
    #Preparing the data
    ys: pd.DataFrame = reviews.head(170536) #30% of the dataframe is test data
    xs: pd.DataFrame = reviews[170537:] #70% of the dataframe is training data
    
    #Training the model
    lr = LogisticRegression(max_iter=1000)
    cv = CountVectorizer(token_pattern=r'\b\w+\b')
    train = cv.fit_transform(xs["Summary"].apply(lambda x: np.str_(x)))
    test = cv.transform(ys["Summary"].apply(lambda x: np.str_(x)))
    lr.fit(train, xs["Score"])
    
    #Measuring accuracy:
    predictions = lr.predict(test)
    labels = ["x1", "x2", "x3", "x4", "x5"]
    report = classification_report(predictions, ys["Score"],
               target_names = labels, output_dict=True)
    accuracy = [report[label]["precision"] for label in labels]
    print(accuracy)
    

    Conclusion

    Investigating sentimental analysis is a worthwhile area of academic and industrial research that completely relies on machine learning and is bound to its limitations. It is a powerful topic that should be covered in the classical NLP suite. Unfortunately, currently understanding meaning close enough to be able to extract situational meaning is a feat close to inventing Artificial General Intelligence, however technology rapidly grows in that direction.