I'm trying to tune hyperparameters for KNN on a quite small datasets ( Kaggle Leaf which has around 990 lines ):
def knnTuning(self, x_train, t_train):
params = {
'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
grid = GridSearchCV(KNeighborsClassifier(), params)
grid.fit(x_train, t_train)
return knn.KNN(neighbors=grid.best_params_["n_neighbors"],
weight = grid.best_params_["weights"],
leafSize = grid.best_params_["leaf_size"])
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
And I return this classifier
class KNN:
def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
self.clf = KNeighborsClassifier(n_neighbors = neighbors,
weights = weight, leaf_size = leafSize)
def train(self, X, t):
self.clf.fit(X, t)
def predict(self, x):
return self.clf.predict(x)
def global_accuracy(self, X, t):
predicted = self.predict(X)
accuracy = (predicted == t).mean()
return accuracy
I run this several time using 700 lines for the training and 200 for validation, which are chosen with random permutation.
I then got result for the global accuracy from 0.01 (often) to 0.4 (rarely).
I know that i'm not comparing two same metrics but I still can't understand the huge difference between the results.
Not very sure how you trained your model or how the preprocessing was done. The leaf dataset has about 100 labels (species) so you have to take care to split your test and train to ensure an even split of your samples. One reason for the weird accuracy could be that your samples are split unevenly.
Also you would need to scale your features:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")
le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]
If we do the training, and I would be careful about including n_neighbors = 1 :
params = {
'n_neighbors': [2, 3, 4],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)
{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
Then you can check on your test:
pred = grid.predict(x_test)
(y_test == pred).mean()