I have list of words in text variable with their labels. I like to make a classifier that can predict the label of new input text.
I am thinking of using scikit-learn package in Python to use SVM model.
I realize that the text need to be corverted to vector form so I am trying TfidfVectorizer and CountVectorizer.
This is my code so far using TfidfVectorizer:
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
label = ['organisasi','organisasi','organisasi','organisasi','organisasi','lokasi','lokasi','lokasi','lokasi','lokasi']
text = ['Partai Anamat Nasional','Persatuan Sepak Bola', 'Himpunan Mahasiswa','Organisasi Sosial','Masyarakat Peduli','Malioboro','Candi Borobudur','Taman Pintar','Museum Sejarah','Monumen Mandala']
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(text)
y = label
klasifikasi = svm.SVC()
klasifikasi = klasifikasi.fit(X,y) #training
test_text = ['Partai Perjuangan']
test_vector = vectorizer.fit_transform(test_text)
prediksi = klasifikasi.predict([test_vector]) #test
print(prediksi)
I also try the CountVectorizer with same code above. Both showing the same Error result:
ValueError: setting an array element with a sequence.
How to solve this problem? Thanks
The error is due to this line:
prediksi = klasifikasi.predict([test_vector])
Most scikit estimators require an array of shape [n_samples, n_features]
. The test_vector
output from TfidfVectorizer is already in that shape ready to use for estimators. You don't need to wrap it in square brackets ([
and ]
). The wrapping makes it a list which is unsuitable.
Try using it like this:
prediksi = klasifikasi.predict(test_vector)
But even then you will gt error. Because of this line:
test_vector = vectorizer.fit_transform(test_text)
Here you are fitting the vectorizer in a different way than what was learned by the klasifikasi
estimator. fit_transform()
is just a shortcut for calling fit()
(learning the data) and then transform()
it. For test data, always use transform()
method, never fit()
or fit_transform()
So the correct code will be:
test_vector = vectorizer.transform(test_text)
prediksi = klasifikasi.predict(test_vector)
#Output: array(['organisasi'], dtype='|S10')