Search code examples
pythonnltksentiment-analysisnaivebayes

Iterate Naive Bayes classifier over a list of strings


This is an NLP question that hopefully someone can help me with. Specifically trying to do sentiment analysis.

I have a Naive Bayes classifier that has been trained on the well-known data set of tweets that are labeled as either positive or negative:

#convert tokens to a dictionary for NB classifier:
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)
    
pos_model_tokens = get_tweets_for_model(pos_clean_token)
neg_model_tokens = get_tweets_for_model(neg_clean_token)

#prepare training data
positive_dataset = [(tweet_dict, "Positive")
                    for tweet_dict in pos_model_tokens]
negative_dataset = [(tweet_dict, "Negative")
                    for tweet_dict in neg_model_tokens]

dataset = positive_dataset + negative_dataset

#shuffle so all positive tweets aren't first
random.shuffle(dataset) 

#set apart 7000 for training, 3000 for testing
train_data = dataset[:7000]  
test_data = dataset[7000:]

#train model
classifier = NaiveBayesClassifier.train(train_data)

Using this model, I want to iterate through a list of test data and increase a tally for each token whether it gets classified as positive or negative. The test data is a list of strings, which are taken from a data set of text messages.

print(messages[-5:])
>>>["I'm outside, waiting.", 'Have a great day :) See you soon!', "I'll be at work so I can't make it, sry!", 'Are you doing anything this weekend?', 'Thanks for dropping that stuff off :)']

I can get the classification of a single message:

print(classifier.classify(dict([message, True] for message in 
messages[65])))
>>>>Positive

I can return the boolean value of a classification being negative or positive:

neg = (classifier.classify(dict([message, True] for message in messages[65])) == "Negative")

That message in positive, so neg is set to False. So I want iterate over all the messages in the list of messages, and increase the tally of the positive counter if it's positive, and increase the tally of the negative counter if it's negative. But my attempts to do so either increase the positive counter by 1 only, or increase the positive counter only for the entire set of tokens, even though the classifier does return "Negative" on individual tokens. Here's what I tried:

positive_tally = 0
negative_tally = 0

#increments positive_tally by 1
if (classifier.classify(dict([message, True] for message in messages)) == "Positive") == True:
    positive_tally += 1
else:
    negative_tally += 1

#increments positive_tally by 3749 (length of messages list)
for token in tokens:
    if (classifier.classify(dict([message, True] for message in 
messages)) == "Positive") == True:
        positive_tally += 1
    else:
        negative_tally += 1

Any ideas on this one? I'd really appreciate it. I can provide more info if needed.


Solution

  • Okay I got it, posting for posterity in case anyone else gets stuck on a similar problem.

    Basically the classifier takes a string and evaluates each word in the string to make a classification. But I wanted to iterative over a list of strings. So instead of what I had been trying...

    #didn't get what I wanted
    for message in messages:
        if (classifier.classify(dict([message, True] for message in messages))) == "Positive":
            positive_tally += 1
        else: negative_tally += 1
    

    ...which tries (and fails) to classify each message i.e. the entire string, I had to ensure that it was checking each word within each message:

    #works and increases tally as desired!
    for message in messages:
        if classifier.classify(dict([token, True] for token in message)) == "Positive":
            us_pos_tally += 1
        else:
            us_neg_tally += 1
    

    So you go from list level to string level in for message in messages and then string level to word level inside the call of the classifier: dict([token, True] for token in message.