Search code examples
machine-learningsvmpython-3.7sentiment-analysissklearn-pandas

How to load unlabelled data for sentiment classification after training SVM model?


I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?

I used python 3.7. Below is the code.

import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)

train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics


clf = Pipeline([
    ('vectorizer', CountVectorizer(analyzer="word",
                                   tokenizer=word_tokenize,
                                   preprocessor=lambda text: text.replace("<br />", " "),
                                   max_features=None)),
    ('classifier', LinearSVC())
])

clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))

When I run this code, I get the output:

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) Accuracy : 0.8977272727272727 Precision : 0.8604651162790697 Recall : 0.925

What is the meaning of ConvergenceWarning?

Thanks in Advance!


Solution

  • What is the meaning of ConvergenceWarning?

    As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?

    Now I want to use the model to predict the sentiment of unlabeled data. How can I do that?

    You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.

    In your case as you are doing:

    sentiment_data = list(zip(data['Articles'], data['Sentiment']))
    

    You are forming a tuple: Check this out then you are shuffling it and unzip the first 350 rows:

    train_x, train_y = zip(*sentiment_data[:350])
    

    Here you train_x is the column: data['Articles'], so all you have to do if you have new data:

    new_ data = pd.read_csv("new_data.csv", header=0)
    new_y = clf.predict(new_data['Articles'])
    

    how to see whether it is classified as positive or negative?

    You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up