I have recently divided into machine learning using sklearn. After using it with some data I have noticed that no matter if I remove or add features the accuracy didn't change (it is stuck at, 0.66668208448967). in other words
import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import cross_val_score
scores = []
data = pd.read_csv('/Users/fozoro/Downloads/test.csv')
X = data[["x","y"]]
y = data[["correct"]]
knn = LogisticRegression()
knn.fit(X,y.values.ravel())
scores = cross_val_score(knn, X, y.values.ravel(), cv = 10, scoring = "accuracy")
print(scores.mean())
This code prints 0.66668208448967
to better illustrate my point I have added a column to my CSV file which is entirely made out of 0 (I named the column zeros
). after changing
X = data[["x","y"]]
to X = data[["zeros"]]
I end up with this code.
import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import cross_val_score
scores = []
data = pd.read_csv('/Users/fozoro/Downloads/HQ_Questions_Rest_new_test.csv')
X = data[["zeros"]]
y = data[["correct"]]
knn = LogisticRegression()
knn.fit(X,y.values.ravel())
scores = cross_val_score(knn, X, y.values.ravel(), cv = 10, scoring = "accuracy")
print(scores.mean())
and it still prints out the same score of 0.66668208448967
.
At this point, I'm assuming that it is still using the over two columns x
and y
though I fail to understand why. Does anyone know what's the problem?
Thank you very much in advance for your help.
This is a small part of the csv file.
0 44600 yes
12 41700 no
574 14500 no
When I print(data.dtypes) I get the following:
Q + ans int64
Q + ans broken search int64
Bing total Search int64
mean1 float64
mean2 float64
zeros int64
correct int64
dtype: object
When I print(data.describe()) I get the following:
Q + ans Q + ans broken search Bing total Search mean1 \
count 477.000000 477.000000 4.770000e+02 477.000000
mean 3.972746 30.408805 3.661450e+06 3.972746
std 12.112970 133.128478 1.555090e+07 7.292793
min 0.000000 0.000000 0.000000e+00 0.000000
25% 0.000000 0.000000 8.110000e+04 0.000000
50% 0.000000 0.000000 3.790000e+05 1.333333
75% 2.000000 4.000000 2.000000e+06 5.333333
max 162.000000 1908.000000 2.320000e+08 60.666667
mean2 zeros correct
count 477.000000 477.0 477.000000
mean 30.272537 0.0 0.333333
std 76.365587 0.0 0.471899
min 0.000000 0.0 0.000000
25% 0.000000 0.0 0.000000
50% 1.666667 0.0 0.000000
75% 21.000000 0.0 1.000000
max 636.666667 0.0 1.000000
Your problem lies in your "correct" column. You provide strings ("yes" and "no") where numbers are expected.
For example, substitute all "yes" with 1 and all "no" with 0 and then try again.
See the following minimal example:
test.csv:
x,y,correct,zeros
1,1,1.0,0
2,2,0.0, 0
1,2,0.0,0
3,1,1.0,0
3,1,1.0,0
4,2,0.0, 0
5,2,0.0,0
6,1,1.0,0
7,1,1.0,0
8,2,0.0, 0
9,2,0.0,0
10,1,1.0,0
11,1,1.0,0
12,1,1.0,0
13,1,1.0,0
14,1,1.0,0
15,1,1.0,0
16,1,1.0,0
Content of the python file:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
data = pd.read_csv("./test.csv")
X = data[["x","y"]]
y = data[["correct"]]
knn = LogisticRegression()
scores = cross_val_score(knn, X, y.values.ravel(), cv=3, scoring="accuracy")
scores.mean()
Try to replace the line
X = data[["x","y"]]
with X = data[["zeros"]]
and notice the difference!
From the statistics of your data we can learn that 318 of the 477 samples in your data set belong to the 0 (or "no") group. These are 2/3 or 0.666...
So if your model cannot learn anything from the provided features, it will always output a zero (since all coefficients are zero). Hence, for any input, the predicted class will be 0 (or "no"). This is why you get always the same score: The model always predicts a 0 and 2/3 of your data belong to the zero class, so your model is in 66% of the cases right.
With my provided data samples you see that it DOES make a difference, if we use the "x" and "y" column or the "zeros" column. In the first case, we get a score of over 72%. If we just use the meaningless "zeros", we get 66%, because of the class distribution of our data set.