ValueError: too many values to unpack (NLTK classifier)

I'm doing classification analysis using NLTK's Naive Bayes classifier. I insert a tsv file containing records and labels.

But the file doesn't get trained due to an error. Here's my python code

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('tweets.txt', delimiter ='\t', quoting = 3)

dataset.isnull().any()

dataset = dataset.fillna(method='ffill')

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,16004):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = tweet.lower()
    tweet = tweet.split()
    ps = PorterStemmer()
    tweet = [ps.stem(word) for word in tweet if not word in 
    set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    corpus.append(tweet)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values




from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
random_state = 0)
train_set, test_set = X_train[500:], y_train[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

The error is:

File "C:\Users\HSR\Anaconda2\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:

ValueError: too many values to unpack

Solution

NLTKClassifier doesn't work like scikit estimators. It requires the X and y both in a single array which is then passed to train().

But in your code, you are only supplying it the X_train and it tries to unpack y from that and hence the error.

The NaiveBayesClassifier requires the input to be a list of tuples where list denotes the training samples and the tuple has the feature dictionary and label inside. Something like:

X = [({feature1:'val11', feature2:'val12' .... }, class1),
     ({feature1:'val21', feature2:'val22' .... }, class2), 
     ...
     ...                                                  ]

You need to change your input to this format.

feature_names = cv.get_feature_names()
train_set = []
for i, single_sample in enumerate(X):
    single_feature_dict = {}
    for j, single_feature in enumerate(single_sample):
        single_feature_dict[feature_names[j]]=single_feature
    train_set.append((single_feature_dict, y[i]))

Note: The above for loop can be shortened by using dict comprehension but I'm not that fluent there.

Then you can do this:

nltk.NaiveBayesClassifier.train(train_set)