i am trying to apply two different transformation from sklearn to two different columns which both of them are object
inside my Pipeline
. My DataFrame
looks like this ( i avoid all rows just to illustrate my point) :
email country label
fulanito@gmail.com NI True
fipretko@gmail.com AR False
trytryyy@gmail.com CZ True
Both email
and country
are object
type
For email
i create a bunch of function to transform it into some numeric representation of it.
like :
def email_length(email) -> np.array:
return np.array([len(e[0].split('@')[0]) for e in email]).reshape(-1, 1)
def domain_length(email) -> np.array:
return np.array([len(e[0].split('@')[-1]) for e in email]).reshape(-1, 1)
def number_of_vouls(email) -> np.array:
vouls = 'aeiouAEIOU'
name = [e[0].split('@')[0] for e in email]
return np.array([sum(1 for char in name if char in vouls) for name in name]).reshape(-1, 1)
For passing the functions to the email
in a sklearn Pipeline
i was using FunctionTransformer
and FeatureUnion
like this.
get_email_length = FunctionTransformer(email_length)
get_domain_length = FunctionTransformer(domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)
preproc = FeatureUnion([
('email_length', get_email_length),
('domain_length', get_domain_length),
('number_of_vouls', get_number_of_vouls)])
pipe = Pipeline([
('preproc', preproc),
('classifier', LGBMClassifier())
])
But i want to pass also inside my Pipeline
a one hot encoder into country
, which would be the best way to do it given this Pipeline
definition ?
You could try ColumnTransformer:
def email_and_domain_length(df: pd.DataFrame) -> pd.DataFrame:
return df["email"].str.split("@", expand=True).applymap(len)
def number_of_vouls(df: pd.DataFrame) -> pd.DataFrame:
return (
df["email"]
.str.split("@")
.str[0]
.str.lower()
.apply(lambda x: sum(x.count(v) for v in "aeiou"))
.to_frame()
)
get_email_length = FunctionTransformer(email_and_domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)
preproc = ColumnTransformer(
[
("lengths", get_email_length, ["email"]),
("vouls", get_number_of_vouls, ["email"]),
("countries", OneHotEncoder(), ["country"]),
]
)
preproc.fit_transform(df[["email", "country"]])
Just add this to the code in your question, it already works with ndarray input.
preproc = ColumnTransformer(
[
("email_lengths", get_email_length, [0]),
("voul_lengths", get_domain_length, [0]),
("vouls", get_number_of_vouls, [0]),
("countries", OneHotEncoder(), [1]),
]
)
preproc.fit_transform(df[["email", "country"]].to_numpy())
Output:
array([[8., 9., 4., 0., 0., 1.],
[8., 9., 3., 1., 0., 0.],
[8., 9., 0., 0., 1., 0.]])
As an aside, one-hot-encoding would cause more harm than good if country
has high cardinality.
I've also tried to vectorize the preprocessing functions by using .str accessor methods instead of list comprehensions.