Search code examples
pythonscikit-learndatasetnormalization

Perfect accuracy using normalized data pima indians dataset


I am dealing with a weird performance using SVC classifier in sklearn. I decided to use kfold cross validation in pima indians dataset. Since I wanted to try a SVC classifier I normalized the data using MinMaxScaler(feature_range=(0, 1)) to get features values between 0 and 1. But when I run the model I get 100% accuracy in each fold which obviously it is impossible. I looked for any error in the code but didn't come across with something strange. Here is my code. Any suggestion of this behaviour?

PD: I obviously load all needes libraries. I download the dataset from here https://gist.github.com/ktisha/c21e73a1bd1700294ef790c56c8aec1f and parse it to make things easier later on. Did I miss a step?

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv("pima dataset.txt",names = col_names)
X = pima[col_names].as_matrix()
y = pima.label.as_matrix()
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
#check transformations
print(rescaledX[0:5,:])
X_train, X_test, y_train, y_test = train_test_split(rescaledX,y, test_size = 0.2, random_state =42)
from sklearn.svm import SVC
import random
clf_1 = SVC(random_state = 42) #create a default model
clf_1.fit(X_train, y_train) #fitting the model
r_svc = [random.randrange(1,1000) for i in range(3)] #create a random seed for the 3 simulations.
scores_matrix_clf_1 = []
for i in r_svc:
    kf = KFold(n_splits=10, shuffle = True, random_state = i) 
    kf.get_n_splits(X)
    scores = cross_val_score(clf_1, X_train, y_train, cv=kf, n_jobs=-1, scoring = "accuracy")
    print('          SCORES FOR EACH RANDOM THREE SEEDS',i)
    print('-----------------------------SCORES----------------------------------------')
    print(scores, scores.mean())
    scores_matrix_clf_1.append(scores)

The output I am getting is this:

          SCORES FOR EACH RANDOM THREE SEEDS 617
-----------------------------SCORES----------------------------------------
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
          SCORES FOR EACH RANDOM THREE SEEDS 764
-----------------------------SCORES----------------------------------------
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
          SCORES FOR EACH RANDOM THREE SEEDS 395
-----------------------------SCORES----------------------------------------
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0

Solution

  • Your X (input data set) contains the label column, which you are trying to predict. This is called data leakage and almost always leads to 100% accuracy, because you are giving your estimator in one column (feature) an answer that you want to predict.

    Example:

    imagine that you have a data set containing the following features:

    • human height
    • human weight
    • human foot size

    and you want to predict sex.

    So if you will feed height, weight, foot size and sex to your model as an input data set and sex (again) as an output vector, it will recognize that the last feature sex has the highest coefficient (weight) because it's always "predicts" correct sex.