Search code examples
pythonscikit-learnpipelineone-hot-encoding

applying different transformation to two columns which are object, sklearn pipeline


i am trying to apply two different transformation from sklearn to two different columns which both of them are object inside my Pipeline. My DataFrame looks like this ( i avoid all rows just to illustrate my point) :

             email  country  label 
[email protected]       NI   True 
[email protected]       AR  False
[email protected]       CZ   True

Both email and country are object type

For email i create a bunch of function to transform it into some numeric representation of it.

like :

def email_length(email) -> np.array:
    return np.array([len(e[0].split('@')[0]) for e in email]).reshape(-1, 1)

def domain_length(email) -> np.array:
    return np.array([len(e[0].split('@')[-1]) for e in email]).reshape(-1, 1)

def number_of_vouls(email) -> np.array:
    vouls = 'aeiouAEIOU'
    name = [e[0].split('@')[0] for e in email]
    return np.array([sum(1 for char in name if char in vouls) for name in name]).reshape(-1, 1)

For passing the functions to the email in a sklearn Pipeline i was using FunctionTransformer and FeatureUnion like this.

get_email_length = FunctionTransformer(email_length)
get_domain_length = FunctionTransformer(domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)

preproc = FeatureUnion([
        ('email_length', get_email_length),
        ('domain_length', get_domain_length),
        ('number_of_vouls', get_number_of_vouls)])

pipe = Pipeline([
        ('preproc', preproc),
        ('classifier', LGBMClassifier())
        ])

But i want to pass also inside my Pipeline a one hot encoder into country, which would be the best way to do it given this Pipeline definition ?


Solution

  • You could try ColumnTransformer:

    1. With DataFrame input

    def email_and_domain_length(df: pd.DataFrame) -> pd.DataFrame:
        return df["email"].str.split("@", expand=True).applymap(len)
    
    
    def number_of_vouls(df: pd.DataFrame) -> pd.DataFrame:
        return (
            df["email"]
            .str.split("@")
            .str[0]
            .str.lower()
            .apply(lambda x: sum(x.count(v) for v in "aeiou"))
            .to_frame()
        )
    
    
    get_email_length = FunctionTransformer(email_and_domain_length)
    get_number_of_vouls = FunctionTransformer(number_of_vouls)
    
    preproc = ColumnTransformer(
        [
            ("lengths", get_email_length, ["email"]),
            ("vouls", get_number_of_vouls, ["email"]),
            ("countries", OneHotEncoder(), ["country"]),
        ]
    )
    preproc.fit_transform(df[["email", "country"]])
    

    2. With ndarray input:

    Just add this to the code in your question, it already works with ndarray input.

    preproc = ColumnTransformer(
        [
            ("email_lengths", get_email_length, [0]),
            ("voul_lengths", get_domain_length, [0]),
            ("vouls", get_number_of_vouls, [0]),
            ("countries", OneHotEncoder(), [1]),
        ]
    )
    preproc.fit_transform(df[["email", "country"]].to_numpy())
    

    Output:

    array([[8., 9., 4., 0., 0., 1.],
           [8., 9., 3., 1., 0., 0.],
           [8., 9., 0., 0., 1., 0.]])
    

    As an aside, one-hot-encoding would cause more harm than good if country has high cardinality.

    I've also tried to vectorize the preprocessing functions by using .str accessor methods instead of list comprehensions.