Search code examples
pythonscikit-learndistanceknn

Heavily weighted distance returns the same results as regular distance in knn with iris dataset


I am experimenting with the way the weights on the distance affect the performance of the kNN algorithm and for a reproducible example I am working with the iris dataset.

To my surprise, weighting 2 predictors 100 times more than the rest 2 predictors generate identical predictions with the unweighted model. What is this rather counterintuitive finding?

My code is the following:

X_original = iris['data']
Y = iris['target']

sc = StandardScaler() # Defines the parameters of the Scaler

X = sc.fit_transform(X_original)  # Transforms the original data to standardized data and returns them

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)

split = sss.split(X, Y)

s = list(split)

train_index = s[0][0]

test_index = s[0][1]

X_train = X[train_index, ] 

X_test = X[test_index, ] 

Y_train = Y[train_index] 

Y_test = Y[test_index] 

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 6)

iris_fit = knn.fit(X_train, Y_train)  # The data can be passed as numpy arrays or pandas dataframes/series.
                                                  # All the data should be numeric
                                                  # There should be no NaNs

predictions_w1 = knn.predict(X_test)

weights = np.array([1, 1, 100, 100])
weights =weights/np.sum(weights)

knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2, 
                           metric_params={'w': weights})

iris_fit_w = knn_w.fit(X_train, Y_train)  # The data can be passed as numpy arrays or pandas dataframes/series.
                                                  # All the data should be numeric
                                                  # There should be no NaNs

predictions_w100 = knn_w.predict(X_test)

(predictions_w1 != predictions_w100).sum()
0

Solution

  • They are not always the same, add a random state to your train test split and you will see how it changes for different values.

     StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2, random_state=3)
    

    Additionally, the weighted Minkowski distance with such extreme weights on 3rd (petal length) and 4th (petal width) feature basically gives you the same results as if you only ran KNN on these 2 features with unweighted Minkowski. And since they seem to be quite informative then it is no surprise you get very similar results compared to the case of considering all 4 features. See the wiki picture below

    From wiki