python scikit-learn pipeline one-hot-encoding

applying different transformation to two columns which are object, sklearn pipeline

i am trying to apply two different transformation from sklearn to two different columns which both of them are object inside my Pipeline. My DataFrame looks like this ( i avoid all rows just to illustrate my point) :

             email  country  label 
fulanito@gmail.com       NI   True 
fipretko@gmail.com       AR  False
trytryyy@gmail.com       CZ   True

Both email and country are object type

For email i create a bunch of function to transform it into some numeric representation of it.

like :

def email_length(email) -> np.array:
    return np.array([len(e[0].split('@')[0]) for e in email]).reshape(-1, 1)

def domain_length(email) -> np.array:
    return np.array([len(e[0].split('@')[-1]) for e in email]).reshape(-1, 1)

def number_of_vouls(email) -> np.array:
    vouls = 'aeiouAEIOU'
    name = [e[0].split('@')[0] for e in email]
    return np.array([sum(1 for char in name if char in vouls) for name in name]).reshape(-1, 1)

For passing the functions to the email in a sklearn Pipeline i was using FunctionTransformer and FeatureUnion like this.

get_email_length = FunctionTransformer(email_length)
get_domain_length = FunctionTransformer(domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)

preproc = FeatureUnion([
        ('email_length', get_email_length),
        ('domain_length', get_domain_length),
        ('number_of_vouls', get_number_of_vouls)])

pipe = Pipeline([
        ('preproc', preproc),
        ('classifier', LGBMClassifier())
        ])

But i want to pass also inside my Pipeline a one hot encoder into country, which would be the best way to do it given this Pipeline definition ?

Solution

You could try ColumnTransformer:

1. With DataFrame input

def email_and_domain_length(df: pd.DataFrame) -> pd.DataFrame:
    return df["email"].str.split("@", expand=True).applymap(len)


def number_of_vouls(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df["email"]
        .str.split("@")
        .str[0]
        .str.lower()
        .apply(lambda x: sum(x.count(v) for v in "aeiou"))
        .to_frame()
    )


get_email_length = FunctionTransformer(email_and_domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)

preproc = ColumnTransformer(
    [
        ("lengths", get_email_length, ["email"]),
        ("vouls", get_number_of_vouls, ["email"]),
        ("countries", OneHotEncoder(), ["country"]),
    ]
)
preproc.fit_transform(df[["email", "country"]])

2. With ndarray input:

Just add this to the code in your question, it already works with ndarray input.

preproc = ColumnTransformer(
    [
        ("email_lengths", get_email_length, [0]),
        ("voul_lengths", get_domain_length, [0]),
        ("vouls", get_number_of_vouls, [0]),
        ("countries", OneHotEncoder(), [1]),
    ]
)
preproc.fit_transform(df[["email", "country"]].to_numpy())

Output:

array([[8., 9., 4., 0., 0., 1.],
       [8., 9., 3., 1., 0., 0.],
       [8., 9., 0., 0., 1., 0.]])

As an aside, one-hot-encoding would cause more harm than good if country has high cardinality.

I've also tried to vectorize the preprocessing functions by using .str accessor methods instead of list comprehensions.