Search code examples
pythonscikit-learnone-hot-encoding

sklearn:Can't make OneHotEncoder work with Pipeline


I am building a pipline for a model using ColumnTransformer.This is how my pipeline looks like,

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,MinMaxScaler
from sklearn.impute import KNNImputer

imputer_transformer = ColumnTransformer([
    ('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')

category_transformer = ColumnTransformer([
    ("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,6]),
    ("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],handle_unknown='ignore',dtype=np.int16),[3]),
    ("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[2,5]),
],remainder='passthrough')


def build_pipeline_with_estimator(estimator):
    return Pipeline([
    ('imputer',imputer_transformer),
    ('category_transformer',category_transformer),
    ('estimator',estimator),
])

and this is how my dataset looks like,

kms_driven      owner   location    mileage     power    brand              engine  age
34000.0         first       other           NaN         12.0        Yamaha          150.0     9
28000.0         first       other           72.0         7.0         Hero                100.0    16
5947.0           first       other          53.0          19.0       Bajaj                NaN       4
11000.0         first       delhi           40.0          19.8       Royal Enfield   350.0    7
13568.0         first       delhi           63.0          14.0       Suzuki             150.0     5

This is how I am using LinearRegression with my pipeline.

linear_regressor = build_pipeline_with_estimator(LinearRegression())

linear_regressor.fit(X_train,y_train)

print('Linear Regression Train Performance.\n')
print(model_perf(linear_regressor,X_train,y_train))

print('Linear Regression Test Performance.\n')
print(model_perf(linear_regressor,X_test,y_test))

Now, whenever I try to apply linear regression with the pipeline I get this error,

ValueError: could not convert string to float: 'bangalore'

The 'banglore' is one of the value in the location feature, which I am trying to one-hot encode,but it is failing and I can't figure out what is going wrong here.Any help would be appreciated.


Solution

  • After passing the imputer, the non-imputed columns are moved to the right as noted in notes under the documentation:

    Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

    We can try just using the imputer first:

    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
    from sklearn.impute import KNNImputer
    from sklearn.linear_model import LinearRegression
    
    imputer_transformer = ColumnTransformer([
        ('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
    ],remainder='passthrough')
    

    We can try it with an example data and you will see your categorical columns are now shifted right:

    X_train = pd.DataFrame({'kms':[0,1,2],'owner':['first','first','second'],
    'location':['other','other','delhi'],'mileage':[9,8,np.nan],
    'power':[3,2,1],'brand':['A','B','C'],'engine':[10,100,1000],'age':[3,4,5]})
    
    imputer_transformer.fit_transform(X_train)
    Out[25]: 
    array([[0.0, 9.0, 3.0, 10.0, 3.0, 'first', 'other', 'A'],
           [1.0, 8.0, 2.0, 100.0, 4.0, 'first', 'other', 'B'],
           [2.0, 8.5, 1.0, 1000.0, 5.0, 'second', 'delhi', 'C']], dtype=object)
    

    In your case, you can see the engine column is now the fourth column, and your ordinal is the fifth, categorical last two, so a simple solution might be:

    category_transformer = ColumnTransformer([
        ("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,3]),
        ("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],
    handle_unknown='ignore',dtype=np.int16),[5]),
        ("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[6,7]),
    ],remainder='passthrough')
    
    y_train = [7,3,2]
    
    linear_regressor = build_pipeline_with_estimator(LinearRegression())
    
    linear_regressor.fit(X_train,y_train)