python machine-learning scikit-learn knn gridsearchcv

Using GridSearchCV best_params_ gives poor results

I'm trying to tune hyperparameters for KNN on a quite small datasets ( Kaggle Leaf which has around 990 lines ):

def knnTuning(self, x_train, t_train):
    
    params = {
        'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
        'weights': ['uniform', 'distance'],
        'leaf_size': [5,10, 15, 20]
    }
    grid = GridSearchCV(KNeighborsClassifier(), params)
    grid.fit(x_train, t_train)
    
    print(grid.best_params_)
    print(grid.best_score_)
    
    return knn.KNN(neighbors=grid.best_params_["n_neighbors"], 
                   weight = grid.best_params_["weights"],
                   leafSize = grid.best_params_["leaf_size"])

Prints:
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
0.9119999999999999

And I return this classifier

class KNN:

def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
    
    self.clf = KNeighborsClassifier(n_neighbors = neighbors,
                                    weights = weight, leaf_size = leafSize)

def train(self, X, t):
    self.clf.fit(X, t)

def predict(self, x):
    return self.clf.predict(x)

def global_accuracy(self, X, t):
    predicted = self.predict(X)
    accuracy = (predicted == t).mean()
    
    return accuracy

I run this several time using 700 lines for the training and 200 for validation, which are chosen with random permutation.

I then got result for the global accuracy from 0.01 (often) to 0.4 (rarely).

I know that i'm not comparing two same metrics but I still can't understand the huge difference between the results.

Solution

Not very sure how you trained your model or how the preprocessing was done. The leaf dataset has about 100 labels (species) so you have to take care to split your test and train to ensure an even split of your samples. One reason for the weird accuracy could be that your samples are split unevenly.

Also you would need to scale your features:

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit

df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")

le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])

strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]

If we do the training, and I would be careful about including n_neighbors = 1 :

params = {
    'n_neighbors': [2, 3, 4],
    'weights': ['uniform', 'distance'],
    'leaf_size': [5,10, 15, 20]
}

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)

print(grid.best_params_)
print(grid.best_score_)

{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
0.9676258992805755

Then you can check on your test:

pred = grid.predict(x_test)
(y_test == pred).mean()

0.9831649831649831