Search code examples
pythonscikit-learnpipeline

Scikit-learn SequentialFeatureSelector Input contains NaN, infinity or a value too large for dtype('float64'). even with pipeline


I'm trying to use SequentialFeatureSelector and for estimator parameter I'm passing it a pipeline that includes a step that inputes the missing values:

model = Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('imputing',
                                                                   SimpleImputer(fill_value=-1,
                                                                                 strategy='constant')),
                                                                  ('preprocessing',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1300013d0>),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('imputing',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoding',
                                                                   OrdinalEncoder(handle_unknown='ignore'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1300015b0>)])),
                ('model',
                 LGBMClassifier(class_weight='balanced', random_state=1,
                                reg_lambda=0.1))])

Nonetheless when passing this to selector it shows an error, what does not make any sense since I have already fit and evaluated my model and it runs ok

fselector = SequentialFeatureSelector(estimator = model, scoring= "roc_auc", cv = 3, n_jobs= -1, ).fit(X, target)




 _assert_all_finite(X, allow_nan, msg_dtype)
        101                 not allow_nan and not np.isfinite(X).all()):
        102             type_err = 'infinity' if allow_nan else 'NaN, infinity'
    --> 103             raise ValueError(
        104                     msg_err.format
        105                     (type_err,
    
    ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

EDIT:

Reproducible example:

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

clf = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),("model",LogisticRegression(random_state = 1))])                                                                        

SequentialFeatureSelector(estimator = clf,
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

It shows the same error, in spite of the clf can be fit without problems


Solution

  • ScikitLearn's documentation does not state that the SequentialFeatureSelector works with pipeline objects. It only states that the class accepts an unfitted estimator. In view of this, you could remove the classifier from your pipeline, preprocess X, and then pass it along with an unfitted classifier for feature selection as shown in the example below.

    import numpy as np
    from sklearn.feature_selection import SequentialFeatureSelector
    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import load_iris
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import MaxAbsScaler
    
    
    X, y = load_iris(return_X_y = True)
    X[:10,0] = np.NaN
    
    pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
                    ('scaler', MaxAbsScaler())])
    
    
    # Preprocess your data
    X = pipe.fit_transform(X)
    
    # Run the SequentialFeatureSelector
    sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
                               scoring= "accuracy",
                               cv = 3).fit(X, y)
    
    # Check which features are important and transform X
    sfs.get_support()
    X = sfs.transform(X)