Scikit-learn's Iterative Imputer can impute missing values in a round-robin fashion. To evaluate its performance against other conventional regressors, it is possible to build a simple pipeline and get scoring metrics from cross_val_score. The issue is that Iterative Imputer does not have a 'predict' method as per error:
AttributeError: 'IterativeImputer' object has no attribute 'predict'
See a minimum example of what is trying to be achieved:
# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# define scaler, model and pipeline
scaler = StandardScaler() # use any scaler
imputer = IterativeImputer() # with any estimator, default = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer)])
train, test = df.values, df['A'].values
scores = cross_val_score(pipeline, train, test, cv=10, scoring='r2')
print(scores)
What possible solutions exist? If a custom wrapper is needed, how should it be written to include the 'predict' method?
cross_val_score
needs pipeline
with model
at the end (which has predict
)
scaler = StandardScaler()
imputer = IterativeImputer()
model = BayesianRidge() # any model
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
cross_val_score
without model
make no sense.
I see also other problem - with values train
, test
which you use in cross_val_score
.
It should be X
, y
instead of train
, test
but it is only names so it is not so importalt but important is what you assing to variables.
Problem is that X
should be without y
but you use train = df.values
so you create X
with y
df_train = pd.DataFrame({
'X': range(20),
'y': range(20),
})
X_train = df_train[ ['X'] ] # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[ 'y' ] # it has to be single column (Series)
scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
(BTW: you don't have to use .values
)
The same with more columns
df_train = pd.DataFrame({
'A': range(20),
'B': range(20),
'y': range(20),
})
X_train = df_train[ ['A', 'B'] ]
y_train = df_train[ 'y' ]
Minimal working code but with fake data (which are useless)
# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import BayesianRidge
df_train = pd.DataFrame({
'A': range(100), # fake data
'B': range(100), # fake data
'y': range(100), # fake data
})
df_test = pd.DataFrame({
'A': range(20), # fake data
'B': range(20), # fake data
'y': range(20), # fake data
})
# define scaler, model and pipeline
scaler = StandardScaler()
imputer = IterativeImputer()
model = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
X_train = df_train[ ['A', 'B'] ] # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[ 'y' ] # it has to be single column (Series)
scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
print(scores)
X_test = df_test[['A', 'B']]
y_test = df_test['y']
scores = cross_val_score(pipeline, X_test, y_test, cv=10, scoring='r2')
print(scores)