Search code examples
pythonpandasnlpnltktext-classification

Python NLTK and Pandas - text classifier - (newbie ) - importing my data in a format similar to provided example


I'm new to text classification, however I get most of the concepts. In short, I have a list of restaurant reviews in an Excel dataset and I want to use them as my training data. Where I'm struggling is with the example syntax for importing both the actual review and the classification (1 = pos, 0 = neg) as part of my training dataset. I understand how to do this if I create my dataset manually in a tuple (i.e., what I have current have #'ed out under train). Any help is appreciated.

import nltk
from nltk.tokenize import word_tokenize
import pandas as pd

df = pd.read_excel("reviewclasses.xlsx")

customerreview= df.customerreview.tolist() #I want this to be what's in 
"train" below (i.e., "this is a negative review")

reviewrating= df.reviewrating.tolist() #I also want this to be what's in 
"train" below (e.g., 0)

#train = [("Great place to be when you are in Bangalore.", "1"),
#  ("The place was being renovated when I visited so the seating was 
limited.", "0"),
#  ("Loved the ambiance, loved the food", "1"),
#  ("The food is delicious but not over the top.", "0"),
#  ("Service - Little slow, probably because too many people.", "0"),
#  ("The place is not easy to locate", "0"),
#  ("Mushroom fried rice was spicy", "1"),
#]

dictionary = set(word.lower() for passage in train for word in 
word_tokenize(passage[0]))

t = [({word: (word in word_tokenize(x[0])) for word in dictionary}, x[1]) 
for x in train]

# Step 4 – the classifier is trained with sample data
classifier = nltk.NaiveBayesClassifier.train(t)

test_data = "The food sucked and I couldn't wait to leave the terrible 
restaurant."
test_data_features = {word.lower(): (word in 
word_tokenize(test_data.lower())) for word in dictionary}

print (classifier.classify(test_data_features))

Solution

  • I figured it out. I basically just needed to combine two lists into a tuple.

    def merge(customerreview, reviewrating): 
    
        merged_list = [(customerreview[i], reviewrating[i]) for i in range(0, 
    len(customerreview))] 
        return merged_list 
    
    train = (merge(customerreview, reviewrating))