Search code examples
scikit-learntrain-test-split

fit LogisticRegression model on test data get the same score as fit on train data


I fit LogisticRegression model on train data checked the score on test and get

test_score 0.802083

afterward ,Out of curiosity, I fit the model on tets and checked the score on test and somehow get the same test score.

why?

I am using diabitis data

https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?select=diabetes.csv

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression



diab_cols = ['Pregnancies', 'Insulin', 'BMI','Glucose','BloodPressure','DiabetesPedigreeFunction'] 
X = df[diab_cols]# Features 
y = df.Outcome # Target variable 




X_train, X_test, y_train, y_test = train_test_split(X, y, 
    test_size=0.25, 
    random_state=0)




model = LogisticRegression().fit(X_train,y_train)  
 
model_test = LogisticRegression().fit( X_test, y_test)  



print("test_score",model.score(X_test,y_test))
print("test_score",model_test.score(X_test,y_test))

Solution

  • Looks like your test data is a "perfect" representation of training set.

    One possibility is the model created on both training and test set having similar weights. You can verify the weights of the two LogisticRegression models.

    e.g.,

    print(model.coef_[0])
    print(model_test.coef_[0])
    

    If the weights of the models are different, second possibility is that the classification points always lie on the same side of the threshold (default 0.5) and therefore are classified same in both models, keeping the score same. You can check the confidence level of the model for all the test classification by calling the decision_function() method.

    e.g.,

    model.decision_function(x_test)
    model_test.decision_function(x_test)