I encounter this problem when I implement the Knn imputation method for handling missing data from scratch. I create a dummy dataset and find the nearest neighbors for rows that contain missing values here is my dataset
A B C D E
0 NaN 2.0 4.0 10.0 100.0
1 NaN 3.0 9.0 12.0 NaN
2 5.0 2.0 20.0 50.0 75.0
3 3.0 5.0 7.0 NaN 150.0
4 2.0 9.0 7.0 30.0 90.0
for row 0 the nearest neighbors are 1 and 2 and to replace the NaN value at (0, A) we compute the distance average between the nearest neighbors value in the same column but what if one of the nearest neighbors value is also NaN?
Example:
let suppose the nearest neighbors for row 3 is 2 and 4 so in row 3 the missing value in column D and to replace this missing value we compute distance average between nearest neighbors value in column D which is like that
distance average = [(1/D1) * 50.0 + (1/D2) * 30.0]/2
and replace the nan value at (3, D) with this average (where D1 and D2 are corresponding nan euclidian distance). But in the case of row 0 the nearest neighbor is 1 and 2 and to replace the nan value at (0, A ) we need to compute the distance average between row 1 and 2 value in column A the value at (2, A) is 5.0 great but at (1, A) it's NaN so we can't compute like that
distance average = [(1/D3) * NaN + (1/D4) * 5.0]/2
so how do we replace the NaN value at (0, A)? and how does sklearn KNNImputer handle this kind of scenario?
The sklearn KNNImputer
uses the nan_euclidean_distances
metric as a default. According to its user guide
If a sample has more than one feature missing, then the neighbors for that sample can be different depending on the particular feature being imputed.
The algorithm might use different sets of neighborhoods to impute the single missing value in column D and the two missing values in column A.
This is a simple implementation of the KNNImputer:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
A = [np.nan, np.nan, 5, 3, 2]
B = [2, 3, 2, 5, 9]
C = [4, 9, 20, 7, 7]
D = [10, 12, 50, np.nan, 30]
E = [100, np.nan, 75, 150, 90]
columns=['A', 'B', 'C', 'D', 'E']
data = pd.DataFrame(list(zip(A, B, C, D, E)),
columns=columns)
imputer = KNNImputer(n_neighbors=2)
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=columns)
This is the output:
A B C D E
0 3.5 2.0 4.0 10.0 100.0
1 2.5 3.0 9.0 12.0 125.0
2 5.0 2.0 20.0 50.0 75.0
3 3.0 5.0 7.0 11.0 150.0
4 2.0 9.0 7.0 30.0 90.0