python-3.x machine-learning scikit-learn multiclass-classification tfidfvectorizer

ValueError: The number of class labels must be greater than one in Passive Aggressive Classifier

I am trying to implement an online classifier using the 'passive agressive classifer' in scikit learn with the 20 news grops dataset. I am very new to this, thus I am not sure if I have implemented this properly. That being said, I developed a samll code but when I execute it I keep getting the error:

Traceback (most recent call last): File "/home/suleka/Documents/RNN models/passiveagressive.py", line 100, in clf.fit(X, y) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/passive_aggressive.py", line 225, in fit coef_init=coef_init, intercept_init=intercept_init) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py", line 444, in _fit classes, sample_weight, coef_init, intercept_init) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py", line 407, in _partial_fit raise ValueError("The number of class labels must be " ValueError: The number of class labels must be greater than one.

I checked most posts in stackoverflow and they suggested there must be only one unique class. So i did np.unique(labels) and it showed 20 (20 news groups):

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

Can anyone help me out with this error and please let me know if I have implemented it wrong.

My code is shown below:

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.datasets import make_classification
from string import punctuation
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from collections import Counter
from sklearn.preprocessing import MinMaxScaler, LabelBinarizer
from sklearn.utils import shuffle
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')



seed = 42
np.random.seed(seed)

def preProcess():

    newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')

    features = vectorizer.fit_transform(newsgroups_data.data)
    labels= newsgroups_data.target

    return features, labels


if __name__ == '__main__':

    features, labels = preProcess()

    X_train, y_train = shuffle(features, labels, random_state=seed)

    clf = PassiveAggressiveClassifier(random_state=seed)

    n, d =X_train.shape

    print(np.unique(labels))

    error = 0
    iteration = 0
    for i in range(n):
        print(iteration)
        X, y = X_train[i:i + 1], y_train[i:i + 1]

        clf.fit(X, y)
        pred = clf.predict(X)

        print(pred)
        print(y)

        if y - pred != 0:
            error += 1
        iteration += iteration


    print(error)
    print(np.divide(error, n, dtype=np.float))

Thank you in advance!

Solution

The issue lies in this line:

X, y = X_train[i:i + 1], y_train[i:i + 1]

which in inside your for loop, i.e. after you have asked for np.unique(labels) and comfortably found that indeed you have all 20 ones...

Looking closely, you will realize that this line results to a X and y of only one element each (X_train[i] and y_train[i], respectively - in fact, since the error arguably happens in the very first iteration for i=0, you end up with only X_train[0] and y_train[0]), which of course should not be the case when fitting a model; hence, the error message correctly points out that you have only one label in your set (because you have only one sample, that is)...

To convince yourself that this is the case indeed, just insert a print(np.unique(y)) before your clf.fit() - it will print only one label.

It is quite unclear what exactly you are trying to achieve with your for loop; if you are trying to train your classifier to successive pieces of your dataset, you could try changing the [i:i+1] indices to [i:i+k] for some large enough k, but for a 20-label dataset this is not so simple, as you have to ensure that all 20 labels will be present for each call to clf.fit(), otherwise you will end up comparing apples to oranges...

I strongly suggest to start simple: remove the for loop, fit your classifier to the whole of your training set (clf.fit(X_train, y_train)), and check the documentation of scikit-learn for the available performance metrics you can use...

EDIT I just noticed the detail:

I am trying to implement an online classifier

Well, what you are trying to do is certainly not online training (which is a huge topic by itself), as your for loop simply retrains (it tries to, at least) a new classifier from scratch during each iteration.

As I already said, start simple; try to firmly grasp the principles of simple batch training first, before moving to the much more advanced topic of online training, which is definitely not a beginner's one...