python pandas scikit-learn nlp valueerror

ValueError: Found input variables with inconsistent numbers of samples on binary SVM

Trying to run a binary SVM on on the 20_newsgroups dataset. Seem to be getting a ValueError: Found input variables with inconsistent numbers of samples: [783, 1177]. Can anyone suggest why this is happening?

from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer
# from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
categories = ["comp.graphics", 'sci.space']
data_train = fetch_20newsgroups(subset='train', categories=categories, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, random_state=42)

def is_letter_only(word) : 
    return word.isalpha()
all_names = set (names.words())
lemmatizer = WordNetLemmatizer()
def clean_text(docs) : 
    docs_cleaned = []
    for doc in docs:
        doc = doc.lower()
        doc_cleaned = ' '.join(lemmatizer.lemmatize(word)
                for word in doc.split() if is_letter_only(word)
                and word not in all_names)
        docs_cleaned.append(doc_cleaned)
    return docs_cleaned

cleaned_train = clean_text(data_train.data)
label_train = data_train.target
cleaned_test = clean_text(data_train.data)
label_test = data_test.target
len(label_train),len(label_test)

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=None)
term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)
term_docs_test = tfidf_vectorizer.transform(cleaned_test)

from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1.0, random_state=42)

svm.fit(term_docs_train, label_train)

accuracy = svm.score(term_docs_test, label_test)
print(accuracy)

Solution

That error just tells you that you have a discrepancy in the number of samples for which you are trying to predict a label and the number of output labels. It happens because you use the same data as training and test set, but then you try to match the label of the test set which has a different size.

Just fix this line:

cleaned_test = clean_text(data_test.data)

and the result for your script is:

0.966794380587484