python missing-data imputation fancyimpute

Imputation on the test set with fancyimpute

The python package Fancyimpute provides several methods for the imputation of missing values in Python. The documentation provides examples such as:

# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN

# Model each feature with missing values as a function of other features, and
# use that estimate for imputation.
X_filled_ii = IterativeImputer().fit_transform(X_incomplete)

This works fine when applying the imputation method to a dataset X. But what if a training/test split is necessary? Once

X_train_filled = IterativeImputer().fit_transform(X_train_incomplete)

is called, how do I impute the test set and create X_test_filled? The test set needs to be imputed using the information from the training set. I guess that IterativeImputer() should returns and object that can fit X_test_incomplete. Is that possible?

Please note that imputing on the whole dataset and then split into training and test set is not correct.

Solution

The package looks like it mimic's scikit-learn's API. And after looking in the source code, it looks like it does have a transform method.

my_imputer = IterativeImputer()
X_trained_filled = my_imputer.fit_transform(X_train_incomplete)

# now transform test
X_test_filled = my_imputer.transform(X_test)

The imputer will apply the same imputations that it learned from the training set.