Search code examples

Multivariate logistic regression in Python shows error

Im trying to make prediction with logistic regression and to test accuracy with Python and sklearn library. Im using data that I downloaded from here:

its excel file. I wrote a code, but I always get the same error, and the error is:

ValueError: Unknown label type: 'continuous'

I have used the same logic when I made linear regression, and it works for linear regression.

This is the code:

import numpy as np
import pandas as pd
import xlrd
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#Reading data from excel

data = pd.read_excel("DataSet.xls").round(2)
data_size = data.shape[0]
#print("Number of data:",data_size,"\n",data.head())

my_data = data[(data["Superpl"] == 0) & (data["FlyAsh"] == 0) & (data["BlastFurSlag"] == 0)].drop(columns=["Superpl","FlyAsh","BlastFurSlag"])
my_data = my_data[my_data["Days"]<=28]
my_data_size = my_data.shape[0]
#print("Size of dataset for 28 days or less:", my_data_size, "\n", my_data.head())

def logistic_regression(data_input, cement, water,
                          coarse_aggr, fine_aggr, days):

    variable_list = []
    result_list = []

    for column in data_input:


    variable_list = variable_list[:-1]
    result_list = result_list[-1]

    variables = data_input[variable_list]
    results = data_input[result_list]

    #accuracy of prediction (splittig dataframe in train and test)
    var_train, var_test, res_train, res_test = train_test_split(variables, results, test_size = 0.3, random_state = 42)

    #making logistic model and fitting the data into logistic model
    log_regression = linear_model.LogisticRegression()
    model =, res_train)

    input_values = [cement, water, coarse_aggr, fine_aggr, days]

    #predicting the outcome based on the input_values
    predicted_strength = log_regression.predict([input_values]) #adding values for prediction
    predicted_strength = round(predicted_strength[0], 2)

    #calculating accuracy score
    score = log_regression.score(var_test, res_test)
    score = round(score*100, 2)

    prediction_info = "\nPrediction of future strenght: " + str(predicted_strength) + " MPa\n"
    accuracy_info = "Accuracy of prediction: " + str(score) + "%\n"
    full_info = prediction_info + accuracy_info

    return full_info

print(logistic_regression(my_data, 376.0, 214.6, 1003.5, 762.4, 3)) #true value affter 3 days: 16.28 MPa


  • Although you don't provide details of your data, judging from the error and the comment in the last line of your code:

    #true value affter 3 days: 16.28 MPa

    I conclude that you are in a regression (i.e numeric prediction) setting. A linear regression is an appropriate model for this task, but a logistic regression is not: logistic regression is for classification problems, and thus it expects binary (or categorical) data as target variables, not continuous values, hence the error.

    In short, you are trying to apply a model that is inappropriate for your problem.

    UPDATE (after link to the data): Indeed, reading closely the dataset description, you'll see (emphasis added):

    The concrete compressive strength is the regression problem

    while from scikit-learn User's Guide for logistic regression (again, emphasis added):

    Logistic regression, despite its name, is a linear model for classification rather than regression.