I want to incorporate Label Encoding through the scikit learn pipeline. Unfortunately, LabelEncoder() is broken with the pipeline API so that's not an option right now. I tried creating my own class which calls .map() to map categories to labels:
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
class RatingEncoder(BaseEstimator, TransformerMixin):
"""Takes in dataframe, converts all categorical Ratings columns into numerical Ratings columns
via label-encoding"""
def __init__(self):
pass
def fit(self, df, y=None):
return self
def transform(self, df, y=None):
""""Transform all of the categorical ratings columns into numerical ratings columns"""
for feature in df.columns:
df[feature] = df[feature].map({
"Po" : 1,
"Fa" : 2,
"TA" : 3,
"Gd" : 4,
"Ex" : 5,
})
return df
Then, I set up the following pipeline:
def select_numeric_features(df):
return df.select_dtypes(include=np.number).columns
def select_categorical_features(df):
return df.select_dtypes(exclude=np.number).columns
def select_rated_features(df):
rated_features = []
for column in df:
# This criteria determines if a column is a 'rated column'
if any(df[column] == 'TA'):
rated_features.append(column)
return rated_features
pipeline = make_column_transformer(
(RatingsTransformer(), select_rated_features),
(SimpleImputer(strategy='constant', fill_value='None'), select_categorical_features),
(SimpleImputer(strategy='constant', fill_value=0), select_numeric_features),
remainder='passthrough'
)
The problem with this is that after the RatingsTransformer() step, the categorical 'ratings' columns are supposed to become numerical columns. However, this change doesn't show up in the column selection part of the column transformer, so select_numerical_features
and select_categorical_features
will choose the incorrect 'ratings' columns as if they had no been mapped from categories to values. Basically, the column transformer isn't using columns that were updated in the middle of the pipeline. Any workaround for this? Or, is there a simpler solution to LabelEncoding using the pipeline API?
LabelEncoder
is to encode labels and therefore the y
(or target). If you want to encode data (i.e. X
) you can use a OneHotEncoder
or an OrdinalEncoder
which can be easily integrated within a Pipeline
from scikit-learn.
In your case it seems that you want to ordinal encode your data.
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
preprocessor = make_pipeline(
SimpleImputer(strategy="constant", fill_value="missing"),
OrdinalEncoder()
)
preprocessor.fit_transform(X_train)
A more complete example can be found here: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py
You could imagine to use an OrdinalEncoder
instead of the the OneHotEncoder
if the classifier would not be a linear model (e.g. RandomForestClassifier
).