Search code examples
pythonnumpymachine-learningscikit-learnknn

Inconsistent results on almost identical implementations of K-NN algorithm


I'm trying to implement from scratch a K-Nearest Neighbors classification algorithm in Python against this dataset but I've encountered some issues during the validation phase. In particular, this is my normal, straightforward implementation:

def KNNClassificationPredictOnValidation(row, norm_type, k):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, k)[:k]
    values = [y_th_validation_train_np[indexes[i]] for i in range(k)]
    return np.argmax(np.bincount(values))

It can be run this way:

y_pred = []
for row in x_validation_np:
  y_pred.append(KNNClassificationPredictOnValidation(row, 2, 169))

print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")

Getting 58.600031530821376% as accuracy which I am sure to be correct since I have the same identical result with SciKit with the following code:

neigh = KNeighborsClassifier(n_neighbors=169)
neigh.fit(x_validation_train_np, y_th_validation_train_np)
y_pred = neigh.predict(x_validation_np)
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")

However, my implementation is very slow. I wanted to speed up the validation phase without implementing more sophisticated data structures such as k-d or ball trees, and I had an idea.
During the validation phase, I check accuracy results in function of k which goes from left_end to right_end doing steps of 2; with my normal implementation, this implies that indexes and distances get recomputed every time, and they are heavy operations.
However, in addition to being heavy, this is a waste of resources as well! Basically, suppose I communicate to the function the value ofright_end. In that case, I can calculate all the right_end nearest neighbors and return a list of classification results considering for each left_end <= k < right_end just the needed subgroup of neighbors from the ones already calculated:

# Version tweaked for fast validation
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, end)[:end+1]
    return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end, step)]

And this is how I tested it:

# My tweaked version for validation
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
  y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end+1, 2))

results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
  y = y_pred[:, i]
  accuracy = metrics.accuracy_score(y_th_validation_np, y)
  results.append((left_end+i*2, accuracy*100))
print(results)

But this was the output:

[(167, 58.3793157811761), (169, 58.473908245309794), (171, 58.48967365599874)]

So with k=169 I got an accuracy of 58.473908245309794% which is different, and I cannot understand what I am doing wrong: the implementation is identical, I am just testing more cases at the same time.

Below I'll leave a minimum reproducible example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import numpy as np
from sklearn.neighbors import KNeighborsClassifier



df = pd.read_csv('OnlineNewsPopularity/OnlineNewsPopularity.csv')
df = df.rename(columns=lambda x: x.strip())
df = df.iloc[: , 2:]

# non-thresholded shares
y = df.pop('shares')

# thresholded shares
y_th = y.copy(deep=True)
y_th = y_th.apply(lambda x: 1 if x >= 1400 else 0)

# renaming the variable
x = df

# tresholded version
x_train, x_test, y_th_train, y_th_test = train_test_split(
    x, y_th,
    test_size=0.20,
    random_state=1
)

x_train_np = x_train.to_numpy()
x_test_np = x_test.to_numpy()

y_th_train_np = y_th_train.to_numpy()
y_th_test_np = y_th_test.to_numpy()

# Creating validation set
x_validation_train, x_validation, y_th_validation_train, y_th_validation = train_test_split(
    x_train, y_th_train,
    test_size=0.20,
    random_state=1
)


x_validation_train_np = x_validation_train.to_numpy()
x_validation_np = x_validation.to_numpy()

y_th_validation_train_np = y_th_validation_train.to_numpy()
y_th_validation_np = y_th_validation.to_numpy()


def KNNClassificationPredict(row, norm_type, k):
    distances = np.linalg.norm(x_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, k)[:k]
    values = [y_th_train_np[indexes[i]] for i in range(k)]
    return np.argmax(np.bincount(values))

# Version tweaked for fast validation
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, end)[:end+1]
    return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end, step)]

def KNNClassificationPredictOnValidation(row, norm_type, k):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, k)[:k]
    values = [y_th_validation_train_np[indexes[i]] for i in range(k)]
    return np.argmax(np.bincount(values))

# Sklearn implementation against validation set
neigh = KNeighborsClassifier(n_neighbors=169)
neigh.fit(x_validation_train_np, y_th_validation_train_np)
y_pred = neigh.predict(x_validation_np)
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")

# My normal knn against validation set
y_pred = []
for row in x_validation_np:
  y_pred.append(KNNClassificationPredictOnValidation(row, 2, 169))

print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")

# My tweaked version for validation
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
  y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end+1, 2))

results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
  y = y_pred[:, i]
  accuracy = metrics.accuracy_score(y_th_validation_np, y)
  results.append((left_end+i*2, accuracy*100))
print(results)

Solution

  • The issue

    Your issue is with how np.argpartition works.

    The documentation states that np.argpartition(array, kth) will return the indices of array in such an order that the array[indices][kth] element is in its final sorted position, all elements before it (array[indices][:kth]) are smaller, and all elements after it (array[indices][kth+1:]) are larger. But the order of the elements in array[indices][:kth] and array[indices][kth+1:] is not guaranteed, it might be completely random.

    So what happens when you increase the value of kth from 169 to 172 is that array[indices][169] is no longer locked in place, it has randomly fallen somewhere into array[indices][:172].

    Furthermore, some values that were previously guaranteed to be contained in array[indices][:169] is now only guaranteed to be contained in array[indices][:172]. Some of these values may fall into the range array[indices][169:172], and be replaced by values that were previously guaranteed to be contained in array[indices][169+1:].

    The important takeaway from all this is that when you call

    indexes = np.argpartition(distances, end)[:end+1]
    

    with end=172 and then

    [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end+1, step)]
    

    some of the indexes[i] will be wrong when k < end.

    The solution

    np.argpartition accepts kth to be an array. When kth is an array, the indices returned guarantees all the array[indices][kth] to be in their final sorted position. We can do this for the entire problematic range left_end:right_end without much performance impact:

    def KNNClassificationValidationPredict(row, norm_type, start, end, step):
        distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
        indexes = np.argpartition(distances, range(start, end+1))[:end+1]
        return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end+1, step)]
    
    left_end = 167
    right_end = 171
    y_pred = []
    for row in x_validation_np:
        y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end, 2))
    
    results = []
    y_pred = np.array([np.array(y) for y in y_pred])
    for i in range(len(y_pred[0])):
        y = y_pred[:, i]
        accuracy = metrics.accuracy_score(y_th_validation_np, y)
        results.append((left_end+i*2, accuracy*100))
    print(results)
    

    Which outputs

    [(167, 58.48967365599874), (169, 58.600031530821376), (171, 58.44237742393189)]