Search code examples
pythonmachine-learningscikit-learnmissing-dataimputation

Wrapper custom class for scikit-learn's Iterative Imputer for use with cross_val_score()


Scikit-learn's Iterative Imputer can impute missing values in a round-robin fashion. To evaluate its performance against other conventional regressors, it is possible to build a simple pipeline and get scoring metrics from cross_val_score. The issue is that Iterative Imputer does not have a 'predict' method as per error:

AttributeError: 'IterativeImputer' object has no attribute 'predict'

See a minimum example of what is trying to be achieved:

# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# define scaler, model and pipeline
scaler = StandardScaler() # use any scaler
imputer = IterativeImputer() # with any estimator, default = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer)])

train, test = df.values, df['A'].values 
scores = cross_val_score(pipeline, train, test, cv=10, scoring='r2')
print(scores)

What possible solutions exist? If a custom wrapper is needed, how should it be written to include the 'predict' method?


Solution

  • cross_val_score needs pipeline with model at the end (which has predict)

    scaler  = StandardScaler()
    imputer = IterativeImputer()
    model   = BayesianRidge()  # any model
    
    pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
    

    cross_val_score without model make no sense.


    I see also other problem - with values train, test which you use in cross_val_score.

    It should be X, y instead of train, test but it is only names so it is not so importalt but important is what you assing to variables.

    Problem is that X should be without y but you use train = df.values so you create X with y

    df_train = pd.DataFrame({
                    'X': range(20), 
                    'y': range(20),
               })
    
    X_train = df_train[ ['X'] ]  # it needs inner `[]` to create DataFrame, not Series
    y_train = df_train[  'y'  ]  # it has to be single column (Series)
    
    scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
    

    (BTW: you don't have to use .values)

    The same with more columns

    df_train = pd.DataFrame({
                    'A': range(20), 
                    'B': range(20), 
                    'y': range(20),
               })
    
    X_train = df_train[ ['A', 'B'] ]
    y_train = df_train[ 'y' ]
    

    Minimal working code but with fake data (which are useless)

    # import libraries
    import pandas as pd
    from sklearn.experimental import enable_iterative_imputer
    from sklearn.impute import IterativeImputer
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import cross_val_score
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import BayesianRidge
    
    df_train = pd.DataFrame({
                    'A': range(100),  # fake data
                    'B': range(100),  # fake data
                    'y': range(100),  # fake data
               })
    
    df_test = pd.DataFrame({
                    'A': range(20),  # fake data
                    'B': range(20),  # fake data
                    'y': range(20),  # fake data
               })
    
    # define scaler, model and pipeline
    scaler  = StandardScaler()
    imputer = IterativeImputer()
    model   = BayesianRidge()
    
    pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
    
    X_train = df_train[ ['A', 'B'] ]  # it needs inner `[]` to create DataFrame, not Series
    y_train = df_train[ 'y' ]         # it has to be single column (Series)
    
    scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
    print(scores)
    
    X_test = df_test[['A', 'B']]
    y_test = df_test['y']
    
    scores = cross_val_score(pipeline, X_test, y_test, cv=10, scoring='r2')
    print(scores)