Search code examples
nltkpython-3.7sentiment-analysispredictnaivebayes

How to predict Sentiments after training and testing the model by using NLTK NaiveBayesClassifier in Python?


I am doing sentiment classification using NLTK NaiveBayesClassifier. I trained and test the model with the labeled data. Now I want to predict sentiments of the data that is not labeled. However, I run into the error. The line that is giving error is :

score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))

The error is :

ValueError: not enough values to unpack (expected 2, got 1)

Below is the code:

import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
new_data = pd.read_csv("Japan Data.csv", header=0)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])

from unidecode import unidecode
from nltk import word_tokenize
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import extract_unigram_feats

TRAINING_COUNT = 350


def clean_text(text):
    text = text.replace("<br />", " ")

    return text


analyzer = SentimentAnalyzer()
vocabulary = analyzer.all_words([(word_tokenize(unidecode(clean_text(instance))))
                                 for instance in train_x[:TRAINING_COUNT]])
print("Vocabulary: ", len(vocabulary))

print("Computing Unigran Features ...")

unigram_features = analyzer.unigram_word_feats(vocabulary, min_freq=10)

print("Unigram Features: ", len(unigram_features))

analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_features)

# Build the training set
_train_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
                                    for instance in train_x[:TRAINING_COUNT]], labeled=False)

# Build the test set
_test_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
                                   for instance in test_x], labeled=False)

trainer = NaiveBayesClassifier.train
classifier = analyzer.train(trainer, zip(_train_X, train_y[:TRAINING_COUNT]))

score = analyzer.evaluate(list(zip(_test_X, test_y)))
print("Accuracy: ", score['Accuracy'])

score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))
print(score_1)

I understand that the problem is arising because I have to give two parameters is the line which is giving an error but I don't know how to do this.

Thanks in Advance.


Solution

  • Documentation and example

    The line that gives you the error calls the method SentimentAnalyzer.evaluate(...) . This method does the following.

    Evaluate and print classifier performance on the test set.

    See SentimentAnalyzer.evaluate.

    The method has one mandatory parameter: test_set .

    test_set – A list of (tokens, label) tuples to use as gold set.

    In the example at http://www.nltk.org/howto/sentiment.html test_set has the following structure:

    [({'contains(,)': False, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ({'contains(,)': True, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ...]
    

    Here is a symbolic representation of the structure.

    [(dictionary,label), ... , (dictionary,label)]
    

    Error in your code

    You are passing

    list(zip(new_data['Articles']))
    

    to SentimentAnalyzer.evaluate. I assume your getting the error because

    list(zip(new_data['Articles']))
    

    does not create a list of (tokens, label) tuples. You can check that by creating a variable which contains the list and printing it or looking at the value of the variable while debugging. E.G.

    test_set = list(zip(new_data['Articles']))
    print("begin test_set")
    print(test_set)
    print("end test_set")
    

    You are calling evaluate correctly 3 lines above the one that is giving the error.

    score = analyzer.evaluate(list(zip(_test_X, test_y)))
    

    I guess you want to call SentimentAnalyzer.classify(instance) to predict unlabeled data. See SentimentAnalyzer.classify.