Search code examples
pythonpython-3.xmachine-learningscikit-learnlogistic-regression

Error on probability calibration in logistic regression : ValueError: could not convert string to float: 'OLIFE'


i have built logistic regression model and did pre-processing using scikit pipeline. i trained and tested and everything was fine but when i try to calibrate my model on valid data i get an error in calib_clf.fit(Valid, labelValid)

ValueError: could not convert string to float: 'OLIFE'

Here is my code:

column_trans = make_column_transformer(
                                        (OneHotEncoder(), ['PRODUCT_LINE_ID','SMOKING_STATUS','gender','Cover_Type']),
                                        remainder = StandardScaler()
                                       )

column_trans.fit_transform(train)

# Create a pipeline that scales the data then trains a support vector classifier
logreg = LogisticRegression()
model_pipeline = make_pipeline(column_trans, logreg)

# Fitting the model pipeline
model_pipeline.fit(train,labelTrain)

# Testing the model pipeline on new data/test data
predictions = model_pipeline.predict_proba(test)[:,1]


calib_clf = CalibratedClassifierCV(model_pipeline, method="sigmoid", cv="prefit")
calib_clf.fit(Valid, labelValid)

Solution

  • https://github.com/dnishimoto/python-deep-learning/blob/master/Happiness%20and%20Depression%20Logistic%20Regression.ipynb

    I used happiness vs depressed as a data set. For the number of Cross folds I used 3 instead of prefit.

      calibrated_clf = CalibratedClassifierCV(base_estimator=pipeline['clf'], method="sigmoid", cv=3)
      calibrated_clf.fit(X, y)
    
      print(calibrated_clf.predict_proba(X)[:5, :])
    

    output: (the probability of the occurrence happening and not happening.

       [[0.97151521 0.02848479]
       [0.9953179  0.0046821 ]
       [0.01829911 0.98170089]
       [0.99405208 0.00594792]
       [0.82948843 0.17051157]]
    

    The probability output indicates that the data behavior is not consistent in all cases. It could be the moderate depression category that I combined to create a binomial. moderate depression needs to be stratified and a new target variable needs to be discovered.