python machine-learning scikit-learn cross-validation imputation

Missing values Imputation with five fold cross validation using python

I have a dataset of 165 instances and 49 features with target 1 and 0. This dataset has missing values so i am trying KNNimputer with the five fold cross validation. Here is the code:

from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from pandas import read_csv
imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
df=read_csv('data.csv', header=None,na_values='?')
data=df.values
ix = [i for i in range(data.shape[1]) if i != 49]
X, y = data[:, ix], data[:, 49]
model = RandomForestClassifier()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

But the problem here is I don't need a score. I want the dataset (in five folds or whole) after filling the missing values in the folds because I need to do feature selection using the five folds after the imputation and then classification. So how can i get the dataset after imputation?

Solution

As discussed in the comments, the CV procedure will not be of any actual help here. What you actually need is:

Fit your KNNImputer and use it to transform (impute) your training data
Use this already fitted imputer to transform accordingly your unseen data

This way, both your training and test data will be sharing a common impute procedure, hence whatever feature selection method you choose will be actually applicable to both datasets.

Here is a demonstration with dummy data, adapting the example from the documentation:

import numpy as np
from sklearn.impute import KNNImputer

X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]] # dummy data
imputer = KNNImputer(n_neighbors=2)
X_imp = imputer.fit_transform(X) # fit imputer & transform training dta in 1 step
X_imp
# result:
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

# new (unseen - test) data with missing values:
# we DON'T fit the imputer again
X_new = np.array([[7, 3, 4], [np.nan, 8, 7]])
X_new_imp = imputer.transform(X_new) # use the imputer already fitted with the training data
X_new_imp
# result:
array([[7. , 3. , 4. ],
       [5.5, 8. , 7. ]])