Search code examples
scikit-learnknnimputation

Impute among specific values only


I have a dataframe, where I need to impute a value based on the other samples. The column is numerical and implies industry numbers, fx (1111 - IT, 1234 - Finance, so on). I've tried to apply KNNImputer and it does produce number, but as far as I understood it averages the output of its neighbors, thus generating a number that does not exist in the column.

the imputer code is following:

X = df.copy()
imputer = KNNImputer(n_neighbors=5)
filled = imputer.fit_transform(X)

cols = X.columns

df_imputed = pd.DataFrame(data=filled, columns = cols)

The output it provides is: 6405.2 However, the closest industry codes are 6399 or 6411

How can I make an imputation for numerical column considering the existing values only?


Solution

  • The technical answer to this is actually surprisingly simple: just ask for a single neighbor in your knn imputer:

    imputer = KNNImputer(n_neighbors=1)
    

    This way, the knn predictions will not be averaged among the (many) neighbors, but they will actually consist only of values already existing in your data.

    Notice that this is the answer to the programming question you are actually asking; if this is actually the correct approach based on the specific form of your data and features is beyond the scope of the answer (and arguably off-topic for SO).