Search code examples
pythonmachine-learningscikit-learnlogistic-regression

Inverse of prediction is correct in Scikit Learn Logistic Legression


In the following minimal reproducible dataset, I split a dataset into train and test dataset, fit a logistic regression to the training dataset with scikit learn and predict y based on the x_test.

However the y_pred or y predictions, are correct only if inversed (e.g 0 = 1, and 1 = 0) calculated like so: 1 - y_pred. Why is this the case? I cant figure out if it is something relating to the scaling of x (I have tried with and without the StandardScaler), something related to the logistic regression, or the accuacy score calculation.

In my larger dataset, this is also the case even when using different seeds as random state. I have also tried this Logistic Regression with the same result.

EDIT as pointed out by @Nester it works without standard scaler for this minimal dataset. Larger dataset avaliable here, standardScaler does nothing on this larger dataset, I'll keep the OP smaller dataset as it might help in explaining the problem.

# imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# small dataset
Y = [1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
X =[[0.38373581],[0.56824121],[0.39078066],[0.41532221],[0.3996311 ]
    ,[0.3455455 ],[0.55867358],[0.51977073],[0.51937625],[0.48718916]
    ,[0.37019272],[0.49478954],[0.37277804],[0.6108499 ],[0.39718093]
    ,[0.33776591],[0.36384773],[0.50663667],[0.3247984 ]]


x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=42, stratify=Y)
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

y_pred = 1 - y_pred #          <- why?

accuracy_score(y_test,y_pred)
1.0

Larger dataset accuracy:

accuracy_score(y_test,y_pred)
0.7  # if inversed

thanks for reading


Solution

  • X and Y does not have any relationship at all. Hence, the model is performing poorly. There is reason to say that 1-pred is performing better. If you have more than two classes, then situation would be even more worse.

    %matplotlib inline 
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15,  stratify=Y)
    clf = make_pipeline(StandardScaler(), LogisticRegression())
    clf.fit(x_train, y_train)
    import matplotlib.pyplot as plt
    plt.scatter(clf.named_steps['standardscaler'].transform(x_train),y_train)
    plt.scatter(clf.named_steps['standardscaler'].transform(x_test),y_test)
    print(clf.score(x_test,y_test))
    

    enter image description here

    The relationship is same for your bigger dataset as well.

    enter image description here

    Try to identify other features, which can help you in predicting Y.