Search code examples
pythonscikit-learnmissing-dataimputation

Using Sklearn's GridSearchCV for finding best imputation method without estimator


I'd like to find the best imputation method for missing data in Scikit-learn.

I have a dataset X and I have created an artificially corrupted version of it in X_na, so I can measure the qualities of different imputations. At this point I'm wondering if I could use sklearn's GridSearchCV to do the search over possible imputer versions like this:

imputer_pipeline = Pipeline([("imputer":SimpleImputer())]

params = [{"imputer":[SimpleImputer()]},
          {"imputer":[IterativeImputer()]},
          {"imputer":[KNNImputer()], "imputer__n_neighbors": [3, 5, 7]}]

imputer_grid = GridSearchCV(imputer_pipe, param_grid=params, scoring="mse", cv=5)
imputer_grid.fit(X_na, X)

But the problem is that imputer_grid.fit does'n channel X_na and X to the imputer pipeline, I cannot instruct it to compare the imputed X_na and X by scoring (mse). The pipeline must have some object with .fit() accepting both X and y.


Solution

  • Not all your imputers have a predict method. You can create a custom function that simply returns the input, i.e return the imputed matrix that was passed, below is something I lifted over from DummyRegressor :

    class IdentityFunction(MultiOutputMixin, RegressorMixin, BaseEstimator):
    
        def __init__(self):
            pass
    
        def fit(self, X, y):
    
            y = check_array(y, ensure_2d=False)
            if len(y) == 0:
                raise ValueError("y must not be empty.")
    
            check_consistent_length(X, y)
    
            return self
    
        def predict(self, X):
            return (X)
    

    Then we define the pipeline using an example dataset:

    from sklearn.pipeline import Pipeline
    from sklearn.experimental import enable_iterative_imputer  # noqa
    from sklearn.impute import IterativeImputer
    from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer
    from sklearn.model_selection import GridSearchCV
    import numpy as np
    
    imputer_pipe = Pipeline([("imputer" , SimpleImputer()),
                            ("identity", IdentityFunction())])
    
    params = [{"imputer":[SimpleImputer()]},
              {"imputer":[IterativeImputer()]},
              {"imputer":[KNNImputer()], "imputer__n_neighbors": [3, 5, 7]}]
    

    Use a dummy dataset and fit :

    X = np.random.uniform(0,1,(100,3))
    X_na = np.where(X<0.3,np.nan,X) 
    
    imputer_grid = GridSearchCV(imputer_pipe, param_grid=params,
                                scoring="neg_mean_squared_error", cv=5)
    imputer_grid.fit(X_na, X)
    

    The results, not useful here because there's no useful information in the dummy matrix to impute :

    Pipeline(steps=[('imputer', IterativeImputer()),
                    ('identity', IdentityFunction())])