Similar to this problem (ColumnTransformer fails with CountVectorizer in a pipeline) I want to apply CountVectorizer/HashingVectorizer
on a column with Text-features using the ColumnTransformer
in a pipeline. But I do not have only one text-feature, but multiple. If I pass a single feature (not as list, like suggested in the solution to the other question) it works fine, how do I do it for multiple?
numeric_features = ['x0', 'x1', 'y0', 'y1']
categorical_features = []
text_features = ['text_feature', 'another_text_feature']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('encoder', OneHotEncoder())])
text_transformer = Pipeline(steps=[('hashing', HashingVectorizer())])
preprocessor = ColumnTransformer(transformers=[
('numeric', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features),
('text', text_transformer, text_features)
])
steps = [('preprocessor', preprocessor),
('clf', SGDClassifier())]
pipeline = Pipeline(steps=steps)
pipeline.fit(X_train, y_train)
Just use a separate transformer for each text feature.
preprocessor = ColumnTransformer(transformers=[
('numeric', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features),
('text', text_transformer, 'text_feature'),
('more_text', text_transformer, 'another_text_feature'),
])
(The transformers get cloned during fitting, so you'll have two separate copies of text_transformer
and everything is fine. If it worries you to specify the same transformer twice like this, you could always copy/clone it manually before specifying the ColumnTransformer
.)