Search code examples
pythonmissing-dataimputationfancyimpute

Imputation on the test set with fancyimpute


The python package Fancyimpute provides several methods for the imputation of missing values in Python. The documentation provides examples such as:

# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN

# Model each feature with missing values as a function of other features, and
# use that estimate for imputation.
X_filled_ii = IterativeImputer().fit_transform(X_incomplete)

This works fine when applying the imputation method to a dataset X. But what if a training/test split is necessary? Once

X_train_filled = IterativeImputer().fit_transform(X_train_incomplete)

is called, how do I impute the test set and create X_test_filled? The test set needs to be imputed using the information from the training set. I guess that IterativeImputer() should returns and object that can fit X_test_incomplete. Is that possible?

Please note that imputing on the whole dataset and then split into training and test set is not correct.


Solution

  • The package looks like it mimic's scikit-learn's API. And after looking in the source code, it looks like it does have a transform method.

    my_imputer = IterativeImputer()
    X_trained_filled = my_imputer.fit_transform(X_train_incomplete)
    
    # now transform test
    X_test_filled = my_imputer.transform(X_test)
    

    The imputer will apply the same imputations that it learned from the training set.