Search code examples
pythonpandasscikit-learnpipelinesklearn-pandas

How do I Label Encode using the Pipeline API?


I want to incorporate Label Encoding through the scikit learn pipeline. Unfortunately, LabelEncoder() is broken with the pipeline API so that's not an option right now. I tried creating my own class which calls .map() to map categories to labels:

from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator

class RatingEncoder(BaseEstimator, TransformerMixin):
    """Takes in dataframe, converts all categorical Ratings columns into numerical Ratings columns
    via label-encoding"""

    def __init__(self):
        pass

    def fit(self, df, y=None):
        return self

    def transform(self, df, y=None):
        """"Transform all of the categorical ratings columns into numerical ratings columns"""
        for feature in df.columns:
            df[feature] = df[feature].map({
                "Po"   : 1,
                "Fa"   : 2,
                "TA"   : 3,
                "Gd"   : 4,
                "Ex"   : 5,
            })
        return df

Then, I set up the following pipeline:

def select_numeric_features(df):
    return df.select_dtypes(include=np.number).columns

def select_categorical_features(df):
    return df.select_dtypes(exclude=np.number).columns

def select_rated_features(df):
    rated_features = []
    for column in df:
        # This criteria determines if a column is a 'rated column'
        if any(df[column] == 'TA'):
            rated_features.append(column)
    return rated_features

pipeline = make_column_transformer(

    (RatingsTransformer(), select_rated_features),
    (SimpleImputer(strategy='constant', fill_value='None'), select_categorical_features),
    (SimpleImputer(strategy='constant', fill_value=0), select_numeric_features),
    remainder='passthrough'

)

The problem with this is that after the RatingsTransformer() step, the categorical 'ratings' columns are supposed to become numerical columns. However, this change doesn't show up in the column selection part of the column transformer, so select_numerical_features and select_categorical_features will choose the incorrect 'ratings' columns as if they had no been mapped from categories to values. Basically, the column transformer isn't using columns that were updated in the middle of the pipeline. Any workaround for this? Or, is there a simpler solution to LabelEncoding using the pipeline API?


Solution

  • LabelEncoder is to encode labels and therefore the y (or target). If you want to encode data (i.e. X) you can use a OneHotEncoder or an OrdinalEncoder which can be easily integrated within a Pipeline from scikit-learn.

    In your case it seems that you want to ordinal encode your data.

    from sklearn.pipeline import make_pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OrdinalEncoder
    
    preprocessor = make_pipeline(
        SimpleImputer(strategy="constant", fill_value="missing"),
        OrdinalEncoder()
    )
    
    preprocessor.fit_transform(X_train)
    

    A more complete example can be found here: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

    You could imagine to use an OrdinalEncoder instead of the the OneHotEncoder if the classifier would not be a linear model (e.g. RandomForestClassifier).