python machine-learning scikit-learn feature-engineering

feature-engine: cross-validation gives error when wrapping OneHotEncoder in SklearnTransformerWrapper

Issue

I am using the feature-engine library, and am finding that when I create an sklearn Pipeline that uses the SklearnTransformerWrapper to wrap a OneHotEncoder, I get the following error when trying to run cross-validation:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
...
9 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.

Below are more details about the failures:
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

If I do things the "old way" with an sklearn ColumnTransformer, I do not get the error. I also don't get errors if I either: A) Score without cross-validation or B) Don't use the categorical features (i.e. remove the one-hot encoding).

Is this an issue with SklearnTransformerWrapper or am I using it the wrong way?

Code

Here is the Pipeline setup with SklearnTransformerWrapper that fails. It will work successfully if we don't use the categorical features, or if we don't do cross-validation (see comments in code):

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression

from feature_engine.wrappers import SklearnTransformerWrapper
from feature_engine.selection import DropFeatures


pipeline_new = Pipeline(steps=[
    ("scale_b_c", SklearnTransformerWrapper(
            transformer=StandardScaler(), 
            variables=["b", "c"]
        )
    ),
    
    # Comment out this step for cross-validation to not fail
    ("encode_a_d", SklearnTransformerWrapper(
            transformer=OneHotEncoder(drop="first", sparse=False), 
            variables=["a", "d"]
        )
    ),
    
    ("cleanup", DropFeatures(["a", "d"])),
    ("model", LinearRegression())
])

# Defined later (putting main example up front)
# Set cv to False to successfully score entire training set
do_test(df, pipeline_new, cv=True)

Here is the "old-style" pipeline that uses ColumnTransformer instead; it works correctly:

from sklearn.compose import ColumnTransformer


pipeline_old = Pipeline(steps=[
    (
        "xform", ColumnTransformer([
            ("cat", OneHotEncoder(drop="first"), ["a", "d"]),
            ("num", StandardScaler(), ["b", "c"])
        ])
    ),
    ("model", LinearRegression())
])

# Defined later (putting main example up front)
do_test(df, pipeline_old, cv=True)

Supporting code: implementation of the do_test() test function:

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# do_test() implementation
def do_test(df, pipeline, cv=True):
    X = df.drop(columns=["y"])
    y = df[["y"]]
       
    if cv:
        return cross_val_score(pipeline, X, y, scoring="neg_mean_squared_error", cv=10)
    else:
        pipeline.fit(X, y)
        y_pred = pipeline.predict(X)        
        return mean_squared_error(y, y_pred)

Supporting code: sample data creation.

import pandas as pd
import numpy as np

# Create sample data
n = 20000
df = pd.DataFrame({
    "a": [["alpha", "beta", "gamma", "delta"][np.random.randint(4)] for i in range(n)],
    "b": [np.random.random() * 100 for i in range(n)],
    "c": [np.random.random() * 200 for i in range(n)],
    "d": [["east", "west"][np.random.randint(2)] for i in range(n)],
})

def make_y(x):
    add_1 = 100 if x.a in ["alpha", "beta"] else 200
    add_2 = 100 if x.d in ["east"] else 300

    return 2 * x.b + 3 * x.c + 2 * add_1 + 5 * add_2 + np.random.normal(10)

df["y"] = df.apply(make_y, axis=1)

Note: I am not doing train/test separation, in order to keep the question shorter.

Solution

The description of the output that @AlwaysRightNeverLeft gives suggests an issue with the indexes: when cross-validating, the dataframes will have nonstandard indexes, and when SklearnTransformerWrapper merges the one-hot encoded array to the original data, it does an "outer join".