I'm trying to implement from scratch a K-Nearest Neighbors classification algorithm in Python against this dataset but I've encountered some issues during the validation phase. In particular, this is my normal, straightforward implementation:
def KNNClassificationPredictOnValidation(row, norm_type, k):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, k)[:k]
values = [y_th_validation_train_np[indexes[i]] for i in range(k)]
return np.argmax(np.bincount(values))
It can be run this way:
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationPredictOnValidation(row, 2, 169))
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")
Getting 58.600031530821376%
as accuracy which I am sure to be correct since I have the same identical result with SciKit with the following code:
neigh = KNeighborsClassifier(n_neighbors=169)
neigh.fit(x_validation_train_np, y_th_validation_train_np)
y_pred = neigh.predict(x_validation_np)
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")
However, my implementation is very slow. I wanted to speed up the validation phase without implementing more sophisticated data structures such as k-d or ball trees, and I had an idea.
During the validation phase, I check accuracy results in function of k which goes from left_end
to right_end
doing steps of 2; with my normal implementation, this implies that indexes
and distances
get recomputed every time, and they are heavy operations.
However, in addition to being heavy, this is a waste of resources as well! Basically, suppose I communicate to the function the value ofright_end
. In that case, I can calculate all the right_end
nearest neighbors and return a list of classification results considering for each left_end <= k < right_end
just the needed subgroup of neighbors from the ones already calculated:
# Version tweaked for fast validation
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, end)[:end+1]
return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end, step)]
And this is how I tested it:
# My tweaked version for validation
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end+1, 2))
results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
y = y_pred[:, i]
accuracy = metrics.accuracy_score(y_th_validation_np, y)
results.append((left_end+i*2, accuracy*100))
print(results)
But this was the output:
[(167, 58.3793157811761), (169, 58.473908245309794), (171, 58.48967365599874)]
So with k=169 I got an accuracy of 58.473908245309794%
which is different, and I cannot understand what I am doing wrong: the implementation is identical, I am just testing more cases at the same time.
Below I'll leave a minimum reproducible example:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
df = pd.read_csv('OnlineNewsPopularity/OnlineNewsPopularity.csv')
df = df.rename(columns=lambda x: x.strip())
df = df.iloc[: , 2:]
# non-thresholded shares
y = df.pop('shares')
# thresholded shares
y_th = y.copy(deep=True)
y_th = y_th.apply(lambda x: 1 if x >= 1400 else 0)
# renaming the variable
x = df
# tresholded version
x_train, x_test, y_th_train, y_th_test = train_test_split(
x, y_th,
test_size=0.20,
random_state=1
)
x_train_np = x_train.to_numpy()
x_test_np = x_test.to_numpy()
y_th_train_np = y_th_train.to_numpy()
y_th_test_np = y_th_test.to_numpy()
# Creating validation set
x_validation_train, x_validation, y_th_validation_train, y_th_validation = train_test_split(
x_train, y_th_train,
test_size=0.20,
random_state=1
)
x_validation_train_np = x_validation_train.to_numpy()
x_validation_np = x_validation.to_numpy()
y_th_validation_train_np = y_th_validation_train.to_numpy()
y_th_validation_np = y_th_validation.to_numpy()
def KNNClassificationPredict(row, norm_type, k):
distances = np.linalg.norm(x_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, k)[:k]
values = [y_th_train_np[indexes[i]] for i in range(k)]
return np.argmax(np.bincount(values))
# Version tweaked for fast validation
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, end)[:end+1]
return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end, step)]
def KNNClassificationPredictOnValidation(row, norm_type, k):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, k)[:k]
values = [y_th_validation_train_np[indexes[i]] for i in range(k)]
return np.argmax(np.bincount(values))
# Sklearn implementation against validation set
neigh = KNeighborsClassifier(n_neighbors=169)
neigh.fit(x_validation_train_np, y_th_validation_train_np)
y_pred = neigh.predict(x_validation_np)
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")
# My normal knn against validation set
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationPredictOnValidation(row, 2, 169))
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")
# My tweaked version for validation
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end+1, 2))
results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
y = y_pred[:, i]
accuracy = metrics.accuracy_score(y_th_validation_np, y)
results.append((left_end+i*2, accuracy*100))
print(results)
Your issue is with how np.argpartition
works.
The documentation states that np.argpartition(array, kth)
will return the indices
of array
in such an order that the array[indices][kth]
element is in its final sorted position, all elements before it (array[indices][:kth]
) are smaller, and all elements after it (array[indices][kth+1:]
) are larger. But the order of the elements in array[indices][:kth]
and array[indices][kth+1:]
is not guaranteed, it might be completely random.
So what happens when you increase the value of kth
from 169
to 172
is that array[indices][169]
is no longer locked in place, it has randomly fallen somewhere into array[indices][:172]
.
Furthermore, some values that were previously guaranteed to be contained in array[indices][:169]
is now only guaranteed to be contained in array[indices][:172]
. Some of these values may fall into the range array[indices][169:172]
, and be replaced by values that were previously guaranteed to be contained in array[indices][169+1:]
.
The important takeaway from all this is that when you call
indexes = np.argpartition(distances, end)[:end+1]
with end=172
and then
[np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end+1, step)]
some of the indexes[i]
will be wrong when k < end
.
np.argpartition
accepts kth
to be an array. When kth
is an array, the indices
returned guarantees all the array[indices][kth]
to be in their final sorted position. We can do this for the entire problematic range left_end:right_end
without much performance impact:
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, range(start, end+1))[:end+1]
return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end+1, step)]
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end, 2))
results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
y = y_pred[:, i]
accuracy = metrics.accuracy_score(y_th_validation_np, y)
results.append((left_end+i*2, accuracy*100))
print(results)
Which outputs
[(167, 58.48967365599874), (169, 58.600031530821376), (171, 58.44237742393189)]