Search code examples
pythonpython-3.xmachine-learningscikit-learntext-classification

ValueError: cannot use sparse input in 'SVC' trained on dense data


I'm trying to run my classifier but I get this error

import pandas
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

dataset = pd.read_csv('all_topics_limpo.csv', encoding = 'utf-8')
data = pandas.get_dummies(dataset['verbatim_corrige'])
labels = dataset['label']

X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.2, random_state = 0)

count_vector = CountVectorizer()
tfidf = TfidfTransformer()

classifier = OneVsOneClassifier(SVC(kernel = 'linear', random_state = 100))
#classifier = LogisticRegression()

train_counts = count_vector.fit_transform(X_train)
train_tfidf = tfidf.fit_transform(train_counts)
classifier.fit(X_train, y_train)

test_counts = count_vector.transform(X_test)
test_tfidf = tfidf.transform(test_counts)
predicted = classifier.predict(test_tfidf)

predicted = classifier.predict(X_test)

print("confusion matrix")
print(confusion_matrix(y_test, predicted, labels = labels))

print("F-score")
print(f1_score(y_test, predicted))
print(precision_score(y_test, predicted))
print(recall_score(y_test, predicted)) 

print("cross validation")
test_counts = count_vector.fit_transform(data)
test_tfidf = tfidf.fit_transform(test_counts)

scores = cross_validation.cross_val_score(classifier, test_tfidf, labels, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))

My output error:

ValueError: cannot use sparse input in 'SVC' trained on dense data

I can not execute my code because of this problem and I am not understanding anything of what is happening.

all output error

Traceback (most recent call last):

File "classification.py", line 42, in predicted = classifier.predict(test_tfidf)

File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 584, in predict Y = self.decision_function(X)

File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 614, in decision_function for est, Xi in zip(self.estimators_, Xs)]).T

File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 614, in for est, Xi in zip(self.estimators_, Xs)]).T

File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 548, in predict y = super(BaseSVC, self).predict(X)

File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 308, in predict X = self._validate_for_predict(X)

File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 448, in _validate_for_predict % type(self).name)

ValueError: cannot use sparse input in 'SVC' trained on dense data


Solution

  • You get this error because your training & test data are not of the same kind: while you train in your initial X_train set:

    classifier.fit(X_train, y_train)
    

    you are trying to get predictions from a dataset which has undergone count vectorization & tf-idf transormations first:

    predicted = classifier.predict(test_tfidf)
    

    It is puzzling why you choose to do so, why you nevertheless compute train_counts and train_tfidf (you don't seem to actually use them anywhere), and why you are also trying to redefine predicted as classifier.predict(X_test) immediately afterwards. Normally, changing your training line to

    classifier.fit(train_tfidf, y_train)
    

    and getting rid of your second predicted definition should work OK...