I am building a pipline for a model using ColumnTransformer.This is how my pipeline looks like,
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,MinMaxScaler
from sklearn.impute import KNNImputer
imputer_transformer = ColumnTransformer([
('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')
category_transformer = ColumnTransformer([
("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,6]),
("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],handle_unknown='ignore',dtype=np.int16),[3]),
("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[2,5]),
],remainder='passthrough')
def build_pipeline_with_estimator(estimator):
return Pipeline([
('imputer',imputer_transformer),
('category_transformer',category_transformer),
('estimator',estimator),
])
and this is how my dataset looks like,
kms_driven owner location mileage power brand engine age
34000.0 first other NaN 12.0 Yamaha 150.0 9
28000.0 first other 72.0 7.0 Hero 100.0 16
5947.0 first other 53.0 19.0 Bajaj NaN 4
11000.0 first delhi 40.0 19.8 Royal Enfield 350.0 7
13568.0 first delhi 63.0 14.0 Suzuki 150.0 5
This is how I am using LinearRegression with my pipeline.
linear_regressor = build_pipeline_with_estimator(LinearRegression())
linear_regressor.fit(X_train,y_train)
print('Linear Regression Train Performance.\n')
print(model_perf(linear_regressor,X_train,y_train))
print('Linear Regression Test Performance.\n')
print(model_perf(linear_regressor,X_test,y_test))
Now, whenever I try to apply linear regression with the pipeline I get this error,
ValueError: could not convert string to float: 'bangalore'
The 'banglore' is one of the value in the location feature, which I am trying to one-hot encode,but it is failing and I can't figure out what is going wrong here.Any help would be appreciated.
After passing the imputer, the non-imputed columns are moved to the right as noted in notes under the documentation:
Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
We can try just using the imputer first:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
imputer_transformer = ColumnTransformer([
('knn_imputer',KNNImputer(n_neighbors=5),[0,3,4,6,7])
],remainder='passthrough')
We can try it with an example data and you will see your categorical columns are now shifted right:
X_train = pd.DataFrame({'kms':[0,1,2],'owner':['first','first','second'],
'location':['other','other','delhi'],'mileage':[9,8,np.nan],
'power':[3,2,1],'brand':['A','B','C'],'engine':[10,100,1000],'age':[3,4,5]})
imputer_transformer.fit_transform(X_train)
Out[25]:
array([[0.0, 9.0, 3.0, 10.0, 3.0, 'first', 'other', 'A'],
[1.0, 8.0, 2.0, 100.0, 4.0, 'first', 'other', 'B'],
[2.0, 8.5, 1.0, 1000.0, 5.0, 'second', 'delhi', 'C']], dtype=object)
In your case, you can see the engine
column is now the fourth column, and your ordinal is the fifth, categorical last two, so a simple solution might be:
category_transformer = ColumnTransformer([
("kms_driven_engine_min_max_scaler",MinMaxScaler(),[0,3]),
("owner_ordinal_enc",OrdinalEncoder(categories=[['fourth','third','second','first']],
handle_unknown='ignore',dtype=np.int16),[5]),
("brand_location_ohe",OneHotEncoder(sparse=False,handle_unknown='ignore'),[6,7]),
],remainder='passthrough')
y_train = [7,3,2]
linear_regressor = build_pipeline_with_estimator(LinearRegression())
linear_regressor.fit(X_train,y_train)