python scikit-learn transformation scikit-learn-pipeline

Stacking up imputers in a pipeline

I've a question about stacking multiple sklearn SimpleImputers in a Pipeline:

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer  

pipeline = Pipeline([
    ('si1', SimpleImputer(missing_values = np.nan,
                          strategy='constant',
                          fill_value=-1)),
    ('si2', SimpleImputer(missing_values = None,
                          strategy='constant',
                          fill_value=-1))
])

train = pd.DataFrame({'f1': [True, 1, 0], 'f2': [None,None,None]})
test1 = pd.DataFrame({'f1': [0, False, 0], 'f2': [np.nan, np.nan, np.nan]})
test2 = pd.DataFrame({'f1': [0, 0, 0], 'f2': [np.nan, np.nan, np.nan]})

pipeline.fit_transform(train)
pipeline.transform(test1)
pipeline.transform(test2)

The code works fine for transforming test1 (which contains a Boolean value), but fails for test2 with:

ValueError: 'X' and 'missing_values' types are expected to be both numerical. Got X.dtype=float64 and type(missing_values)=<class 'NoneType'>.

Apparently, in the presence of a string or Boolean value, the transformation works fine, but it fails when there are only numerical values.

Another weird behavior is when I switch the order of the imputers inside the Pipeline:

pipeline = Pipeline([
    ('si2', SimpleImputer(missing_values = None,
                          strategy='constant',
                          fill_value=-1)),
    ('si1', SimpleImputer(missing_values = np.nan,
                          strategy='constant',
                          fill_value=-1))    
])

In this case, the transformations for test1 and test 2 fail with the following errors respectively:

ValueError: Input contains NaN

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I'm aware of the fact that these types of transformations can be easily done using pandas.DataFrame.replace function. But I'm confused by the behavior and appreciate an explanation of what's going on in each of these scenarios.

Solution

1. First issue - SimpleImputer raises ValueError when the following condition is fulfiled (see documentation):

X.dtype.kind in ("f", "i", "u") and not isinstance(missing_values, numbers.Real)

For reference: isinstance(None, numbers.Real) returns False and isinstance(np.nan, numbers.Real) returns True.

The imputer 'si1' in your Pipeline is always working fine, because not isinstance(np.nan, numbers.Real) is always False and the whole condition is False.

Imputer 'si2' is the reason for the error: not isinstance(None, numbers.Real) is True and everything depends on the dtype of the object you pass to the SimpleImputer. For test1, it's dtype.kind is 'o', but for test2, it's dtype.kind is 'f'. Therefore for test1 the condition for the error is False, but for test2 it's True.

I'm not sure what intention stands behind this condition. The possible workaround is to use SimpleImputers separately instead of stacking them in a Pipeline and change dtype of the result of first SimpleImputer before passing it to the second one.

2. Second issue - if an array you want to pass to SimpleImputer contains np.nans, SimlpeImputer needs np.nan as a missing_value. Otherwise it will produce an error (see documentation).

Let's consider test1 only, the same applies to test2.

a) Consider order:

pipeline = Pipeline([
('si1', SimpleImputer(missing_values = np.nan,
                      strategy='constant',
                      fill_value=-1)),
('si2', SimpleImputer(missing_values = None,
                      strategy='constant',
                      fill_value=-1))

])

When test1 is passed to 'si1', it imputes np.nans and the result is

array([[0, -1],
       [False, -1],
       [0, -1]], dtype=object)

Then this array is passed to 'si2' and the result is

array([[0, -1],
       [False, -1],
       [0, -1]], dtype=object)

and it's all fine.

b) Now consider reverse order:

pipeline = Pipeline([
('si2', SimpleImputer(missing_values = None,
                      strategy='constant',
                      fill_value=-1)),
('si1', SimpleImputer(missing_values = np.nan,
                      strategy='constant',
                      fill_value=-1))

])

test1 includes np.nans, but 'si2' does not impute np.nans. This will produce ValueError, as expected.