Search code examples
pythonmachine-learningtextnlpsvm

I got ValueError: X has 5851 features per sample; expecting 2754 when applying Linear SVC model to test set


I'm trying to classify texts using Linear SVC, but I got an error.

I applied a model to the test set as below. In this code, I made Tfidf, and did oversampling of training set.

#Import datasets
train = pd.read_csv('train_labeled.csv')
test = pd.read_csv('test.csv')

#Clean datasets
custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace,
                   preprocessing.remove_punctuation,
                   preprocessing.remove_urls,
                   preprocessing.remove_digits,
                   preprocessing.stem  
                   ]



train["clean_text"] = train["text"].pipe(hero.clean, custom_pipeline)
test["clean_text"] = test["text"].pipe(hero.clean, custom_pipeline)

#Create Tfidf

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train["clean_text"])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_test_counts = count_vect.fit_transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

#Oversampling of trainig set
over = RandomOverSampler(sampling_strategy='minority')

X_os, y_os = over.fit_resample(X_train_tfidf, train["label"])

#Model
clf = svm.LinearSVC(C=1.0, penalty='l2', loss='squared_hinge', dual=True, tol=1e-3)
clf.fit(X_os, y_os)

pred = clf.predict(X_test_tfidf)

and I got an error like this. I think it's because the test set has 5851 samples, but the training set has 2754 samples.

ValueError: X has 5851 features per sample; expecting 2754

In this case, what am I supposed to do?


Solution

  • Do not call fit_transform() on the test data as the transformers will learn a new vocabulary and not transform the test data the same way the training data was transformed. To use the same vocabulary as for the training data, use only transform() on the test data instead:

    # initialize transformers
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()
    
    # fit and transform train data
    X_train_counts = count_vect.fit_transform(train["clean_text"])
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
    
    # transform test data
    X_test_counts = count_vect.transform(test["clean_text"])
    X_test_tfidf = tfidf_transformer.transform(X_test_counts)
    

    Note

    If you don't need the output of CountVectorizer, you could use TfidfVectorizer to reduce the amount of code to write:

    tfidf_vect = TfidfVectorizer()
    
    X_train_tfidf = tfidf_vect.fit_transform(train["clean_text"])
    X_test_tfidf = tfidf_vect.transform(test["clean_text"])