python scikit-learn vectorization text-classification countvectorizer

Text Classification Using Python

I have list of words in text variable with their labels. I like to make a classifier that can predict the label of new input text.

I am thinking of using scikit-learn package in Python to use SVM model.

I realize that the text need to be corverted to vector form so I am trying TfidfVectorizer and CountVectorizer.

This is my code so far using TfidfVectorizer:

from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer

label = ['organisasi','organisasi','organisasi','organisasi','organisasi','lokasi','lokasi','lokasi','lokasi','lokasi']
text = ['Partai Anamat Nasional','Persatuan Sepak Bola', 'Himpunan Mahasiswa','Organisasi Sosial','Masyarakat Peduli','Malioboro','Candi Borobudur','Taman Pintar','Museum Sejarah','Monumen Mandala']

vectorizer = TfidfVectorizer(min_df=1)

X = vectorizer.fit_transform(text)
y = label

klasifikasi = svm.SVC()

klasifikasi = klasifikasi.fit(X,y) #training

test_text = ['Partai Perjuangan']
test_vector = vectorizer.fit_transform(test_text)

prediksi = klasifikasi.predict([test_vector]) #test

print(prediksi)

I also try the CountVectorizer with same code above. Both showing the same Error result:

ValueError: setting an array element with a sequence.

How to solve this problem? Thanks

Solution

The error is due to this line:

prediksi = klasifikasi.predict([test_vector])

Most scikit estimators require an array of shape [n_samples, n_features]. The test_vector output from TfidfVectorizer is already in that shape ready to use for estimators. You don't need to wrap it in square brackets ([ and ]). The wrapping makes it a list which is unsuitable.

Try using it like this:

prediksi = klasifikasi.predict(test_vector)

But even then you will gt error. Because of this line:

test_vector = vectorizer.fit_transform(test_text)

Here you are fitting the vectorizer in a different way than what was learned by the klasifikasi estimator. fit_transform() is just a shortcut for calling fit() (learning the data) and then transform() it. For test data, always use transform() method, never fit() or fit_transform()

So the correct code will be:

test_vector = vectorizer.transform(test_text)
prediksi = klasifikasi.predict(test_vector)

#Output: array(['organisasi'],  dtype='|S10')