Search code examples
scikit-learntext-classificationvalueerror

calibrated classifier ValueError: could not convert string to float


Dataframe:

id    review                                              name         label
1     it is a great product for turning lights on.        Ashley       
2     plays music and have a good sound.                  Alex        
3     I love it, lots of fun.                             Peter        

I want to use probabilistic classifier (linear_svc) to predict labels (probability of 1) based on review. My code:

from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets

#Load  dataset
X = training['review']
y = training['label']

linear_svc = LinearSVC()     #The base estimator

# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
                                        method='sigmoid',  #sigmoid will use Platt's scaling. Refer to documentation for other methods.
                                        cv=3) 
calibrated_svc.fit(X, y)


# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)

It gives following error on calibrated_svc.fit(X, y):

ValueError: could not convert string to float: 'it is a great product for turning...'

I would appreciate your help.


Solution

  • SVM models cannot handle text data directly. You need to extract some numeric features from the text first. I recommend reading some content on NLP such as Bag of Words and TF-IDF. In any case, for the example you're suggesting, a functional minimal pipeline would be:

    from sklearn.calibration import CalibratedClassifierCV
    from sklearn import datasets
    from sklearn.pipeline import make_pipeline
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    #Load  dataset
    X = training['review']
    y = training['label']
    
    linear_svc = make_pipeline(TfIdfVectorizer(), LinearSVC())
    
    # This is the calibrated classifier which can give probabilistic classifier
    calibrated_svc = CalibratedClassifierCV(linear_svc,
                                            method='sigmoid',
                                            cv=3) 
    calibrated_svc.fit(X, y)
    
    
    # predict
    prediction_data = predict_data['review']
    predicted_probs = calibrated_svc.predict_proba(prediction_data)
    

    You probably also want to clean the text a bit by removing special characters, lowercasing, stemming, etc. Take a look at spacy the library for text-processing.