This is the first time I am building a sentiment analysis machine learning model using the nltk NaiveBayesClassifier in Python. I know it is too simple of a model, but it is just a first step for me and I will try tokenized sentences next time.
The real issue I have with my current model is: I have clearly labeled the word 'bad' as negative in the training data set (as you can see from the 'negative_vocab' variable). However, when I ran the NaiveBayesClassifier on each sentence (lower case) in the list ['awesome movie', ' i like it', ' it is so bad'], the classifier mistakenly labeled 'it is so bad' as positive.
INPUT:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')
def word_feat(word):
return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].
for word in words:
classResult = classifier.classify(word_feat(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(word) + ' is ' + str(classResult))
print()
OUTPUT:
awesome movie is pos
i like it is pos
it is so bad is pos
To make sure the function 'word_feat(word)' iterates over each sentences instead of each word or letter, I did some diagnostic codes to see what is each element in 'word_feat(word)':
for word in words:
print(word_feat(word))
And it printed out:
{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}
So it seems like the function 'word_feat(word)' is correct?
Does anyone know why the classifier classified 'It is so bad' as positive? As mentioned before, I had clearly labeled the word 'bad' as negative in my training data.
Here is the modified code for you
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.') # these are actually list of sentences
for sent in sentences:
if sent != "":
words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
classResult = classifier.classify(word_feats(words))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(sent) + ' --> ' + str(classResult))
print
I modified where you are considering 'list of words' as an input to your classifier. But Actually you need to pass sentence one by one, which means you need to pass 'list of sentences'
Also, for each sentence, you need to pass 'words as features', which means you need to split the sentence on white-space character.
Also, if you want your classifier to work properly for sentiment analysis, you need to give less preference to "stop-words" like "it, they, is etc". As these words are not sufficient to decide if the sentence is positive, negative or neutral.
The above code gives below output
awesome movie --> pos
i like it --> pos
it is so bad --> neg
So for any classifier, the input format for training classifier and predicting classifier should be same. While training you are providing list of words, try to use the same method to convert your test set as well.