python pandas machine-learning scikit-learn supervised-learning

sklearn Features don't affect accuracy

I have recently divided into machine learning using sklearn. After using it with some data I have noticed that no matter if I remove or add features the accuracy didn't change (it is stuck at, 0.66668208448967). in other words

import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import cross_val_score


scores = []
data = pd.read_csv('/Users/fozoro/Downloads/test.csv')

X = data[["x","y"]]
y = data[["correct"]]

knn = LogisticRegression()
knn.fit(X,y.values.ravel())

scores = cross_val_score(knn, X, y.values.ravel(), cv = 10, scoring = "accuracy")

print(scores.mean())

This code prints 0.66668208448967

to better illustrate my point I have added a column to my CSV file which is entirely made out of 0 (I named the column zeros). after changing X = data[["x","y"]] to X = data[["zeros"]] I end up with this code.

import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import cross_val_score


scores = []
data = pd.read_csv('/Users/fozoro/Downloads/HQ_Questions_Rest_new_test.csv')

X = data[["zeros"]]
y = data[["correct"]]

knn = LogisticRegression()
knn.fit(X,y.values.ravel())

scores = cross_val_score(knn, X, y.values.ravel(), cv = 10, scoring = "accuracy")

print(scores.mean())

and it still prints out the same score of 0.66668208448967.

At this point, I'm assuming that it is still using the over two columns x and y though I fail to understand why. Does anyone know what's the problem?

Thank you very much in advance for your help.

This is a small part of the csv file.

0   44600  yes
12  41700  no
574 14500  no

When I print(data.dtypes) I get the following:

Q + ans                    int64
Q + ans broken search      int64
Bing total Search          int64
mean1                    float64
mean2                    float64
zeros                      int64
correct                    int64
dtype: object

When I print(data.describe()) I get the following:

          Q + ans  Q + ans broken search  Bing total Search       mean1  \
count  477.000000             477.000000       4.770000e+02  477.000000   
mean     3.972746              30.408805       3.661450e+06    3.972746   
std     12.112970             133.128478       1.555090e+07    7.292793   
min      0.000000               0.000000       0.000000e+00    0.000000   
25%      0.000000               0.000000       8.110000e+04    0.000000   
50%      0.000000               0.000000       3.790000e+05    1.333333   
75%      2.000000               4.000000       2.000000e+06    5.333333   
max    162.000000            1908.000000       2.320000e+08   60.666667   

                mean2  zeros     correct  
    count  477.000000  477.0  477.000000  
    mean    30.272537    0.0    0.333333  
    std     76.365587    0.0    0.471899  
    min      0.000000    0.0    0.000000  
    25%      0.000000    0.0    0.000000  
    50%      1.666667    0.0    0.000000  
    75%     21.000000    0.0    1.000000  
    max    636.666667    0.0    1.000000

Solution

Your problem lies in your "correct" column. You provide strings ("yes" and "no") where numbers are expected.

For example, substitute all "yes" with 1 and all "no" with 0 and then try again.

See the following minimal example:

test.csv:

x,y,correct,zeros
1,1,1.0,0
2,2,0.0, 0
1,2,0.0,0
3,1,1.0,0
3,1,1.0,0
4,2,0.0, 0
5,2,0.0,0
6,1,1.0,0
7,1,1.0,0
8,2,0.0, 0
9,2,0.0,0
10,1,1.0,0
11,1,1.0,0
12,1,1.0,0
13,1,1.0,0
14,1,1.0,0
15,1,1.0,0
16,1,1.0,0

Content of the python file:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
data = pd.read_csv("./test.csv")
X = data[["x","y"]]
y = data[["correct"]]
knn = LogisticRegression()
scores = cross_val_score(knn, X, y.values.ravel(), cv=3, scoring="accuracy")
scores.mean()

Try to replace the line X = data[["x","y"]] with X = data[["zeros"]] and notice the difference!

From the statistics of your data we can learn that 318 of the 477 samples in your data set belong to the 0 (or "no") group. These are 2/3 or 0.666... So if your model cannot learn anything from the provided features, it will always output a zero (since all coefficients are zero). Hence, for any input, the predicted class will be 0 (or "no"). This is why you get always the same score: The model always predicts a 0 and 2/3 of your data belong to the zero class, so your model is in 66% of the cases right.

With my provided data samples you see that it DOES make a difference, if we use the "x" and "y" column or the "zeros" column. In the first case, we get a score of over 72%. If we just use the meaningless "zeros", we get 66%, because of the class distribution of our data set.