Search code examples
pythonpandasmachine-learningscikit-learnsupervised-learning

sklearn Features don't affect accuracy


I have recently divided into machine learning using sklearn. After using it with some data I have noticed that no matter if I remove or add features the accuracy didn't change (it is stuck at, 0.66668208448967). in other words

import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import cross_val_score


scores = []
data = pd.read_csv('/Users/fozoro/Downloads/test.csv')

X = data[["x","y"]]
y = data[["correct"]]

knn = LogisticRegression()
knn.fit(X,y.values.ravel())

scores = cross_val_score(knn, X, y.values.ravel(), cv = 10, scoring = "accuracy")

print(scores.mean())

This code prints 0.66668208448967

to better illustrate my point I have added a column to my CSV file which is entirely made out of 0 (I named the column zeros). after changing X = data[["x","y"]] to X = data[["zeros"]] I end up with this code.

import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import cross_val_score


scores = []
data = pd.read_csv('/Users/fozoro/Downloads/HQ_Questions_Rest_new_test.csv')

X = data[["zeros"]]
y = data[["correct"]]

knn = LogisticRegression()
knn.fit(X,y.values.ravel())

scores = cross_val_score(knn, X, y.values.ravel(), cv = 10, scoring = "accuracy")

print(scores.mean())

and it still prints out the same score of 0.66668208448967.

At this point, I'm assuming that it is still using the over two columns x and y though I fail to understand why. Does anyone know what's the problem?

Thank you very much in advance for your help.

This is a small part of the csv file.

0   44600  yes
12  41700  no
574 14500  no

When I print(data.dtypes) I get the following:

Q + ans                    int64
Q + ans broken search      int64
Bing total Search          int64
mean1                    float64
mean2                    float64
zeros                      int64
correct                    int64
dtype: object

When I print(data.describe()) I get the following:

          Q + ans  Q + ans broken search  Bing total Search       mean1  \
count  477.000000             477.000000       4.770000e+02  477.000000   
mean     3.972746              30.408805       3.661450e+06    3.972746   
std     12.112970             133.128478       1.555090e+07    7.292793   
min      0.000000               0.000000       0.000000e+00    0.000000   
25%      0.000000               0.000000       8.110000e+04    0.000000   
50%      0.000000               0.000000       3.790000e+05    1.333333   
75%      2.000000               4.000000       2.000000e+06    5.333333   
max    162.000000            1908.000000       2.320000e+08   60.666667   

                mean2  zeros     correct  
    count  477.000000  477.0  477.000000  
    mean    30.272537    0.0    0.333333  
    std     76.365587    0.0    0.471899  
    min      0.000000    0.0    0.000000  
    25%      0.000000    0.0    0.000000  
    50%      1.666667    0.0    0.000000  
    75%     21.000000    0.0    1.000000  
    max    636.666667    0.0    1.000000  

Solution

  • Your problem lies in your "correct" column. You provide strings ("yes" and "no") where numbers are expected.

    For example, substitute all "yes" with 1 and all "no" with 0 and then try again.

    See the following minimal example:

    test.csv:

    x,y,correct,zeros
    1,1,1.0,0
    2,2,0.0, 0
    1,2,0.0,0
    3,1,1.0,0
    3,1,1.0,0
    4,2,0.0, 0
    5,2,0.0,0
    6,1,1.0,0
    7,1,1.0,0
    8,2,0.0, 0
    9,2,0.0,0
    10,1,1.0,0
    11,1,1.0,0
    12,1,1.0,0
    13,1,1.0,0
    14,1,1.0,0
    15,1,1.0,0
    16,1,1.0,0
    

    Content of the python file:

    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import cross_val_score
    data = pd.read_csv("./test.csv")
    X = data[["x","y"]]
    y = data[["correct"]]
    knn = LogisticRegression()
    scores = cross_val_score(knn, X, y.values.ravel(), cv=3, scoring="accuracy")
    scores.mean()
    

    Try to replace the line X = data[["x","y"]] with X = data[["zeros"]] and notice the difference!

    From the statistics of your data we can learn that 318 of the 477 samples in your data set belong to the 0 (or "no") group. These are 2/3 or 0.666... So if your model cannot learn anything from the provided features, it will always output a zero (since all coefficients are zero). Hence, for any input, the predicted class will be 0 (or "no"). This is why you get always the same score: The model always predicts a 0 and 2/3 of your data belong to the zero class, so your model is in 66% of the cases right.

    With my provided data samples you see that it DOES make a difference, if we use the "x" and "y" column or the "zeros" column. In the first case, we get a score of over 72%. If we just use the meaningless "zeros", we get 66%, because of the class distribution of our data set.