Search code examples
pythonscikit-learnknnnearest-neighbor

KNeighbors Regressor .predict() function giving suspiciously perfect results when trained with weights='distance'?


If I train a KNeighborsRegressor (via scikit-learn) and then want to compare its predictions against the target variable, I can do that this way:

#Initiate model
knn = neighbors.KNeighborsRegressor(n_neighbors=8)

#Define independent and target variables
X = df[['var1', 'var2', 'var3']]
Y = df['target']

#fit the model and store the predictions
knn.fit(X, Y)
predicted = knn.predict(X).ravel()

If I were to compare them I can see this model is far from perfect, which is expected:

compare = pd.DataFrame(predicted,Y).reset_index()
compare.columns=['Y', 'predicted']
compare.head(3)

Returns:

+------+-----------+
| Y    | predicted |
+------+-----------+
| 985  | 2596      |
+------+-----------+
| 801  | 2464      |
+------+-----------+
| 1349 | 1907      |
+------+-----------+

If I do the exact same thing except I weight neighbors by distance, the predict() function is returning the target variable EXACTLY.

#Initiate model
knn_dist = neighbors.KNeighborsRegressor(n_neighbors=8, weights='distance')

#fit the model and store the predictions
knn_dist.fit(X, Y)
predicted2 = knn_dist.predict(X).ravel()

compare = pd.DataFrame(predicted2,Y).reset_index()
compare.columns=['Y', 'predicted2']
compare.head(3)

Returns identical columns:

+------+------------+
| Y    | predicted2 |
+------+------------+
| 985  | 985        |
+------+------------+
| 801  | 801        |
+------+------------+
| 1349 | 1349       |
+------+------------+

I know the predictor isn't really perfect like this implies, and can prove that with cross validation:

score_knn = cross_val_score(knn, X, Y, cv=ShuffleSplit(test_size=0.1))
print(score_knn.mean())
>>>>0.5306705590672681

What am I doing wrong?


Per request, here's the first five rows of the relevant columns in my dataframe:

| ID | var1     | var2     | var3     | target |
|----|----------|----------|----------|--------|
| 1  | 0.363625 | 0.805833 | 0.160446 | 985    |
| 2  | 0.353739 | 0.696087 | 0.248539 | 801    |
| 3  | 0.189405 | 0.437273 | 0.248309 | 1349   |
| 4  | 0.212122 | 0.590435 | 0.160296 | 1562   |
| 5  | 0.22927  | 0.436957 | 0.1869   | 1600   |

Solution

  • First of all, you train the model on the whole dataset and then you predict using the same dataset.

    knn_dist.fit(X, Y)

    predicted2 = knn_dist.predict(X).ravel()

    The perfect performance here is a textbook case of overfitting. For every point in X, the weighting for that point will be essentially 1


    Next, when you use cross validation you see that the model is not so perfect. You should always use cross-validation especially in the case where you are trying to predict (regression) a target variable.

    Also, for regression problems do NOT use cross_val_score without specifying the scoring argument.

    You can alternatively use cross_val_predict. See here

    If you add some information (like the dimensions of X) I could help more.