Search code examples
pythonscikit-learnpipeline

ColumnTransformer fails with CountVectorizer/HashingVectorizer in a pipeline (multiple textfeatures)


Similar to this problem (ColumnTransformer fails with CountVectorizer in a pipeline) I want to apply CountVectorizer/HashingVectorizer on a column with Text-features using the ColumnTransformer in a pipeline. But I do not have only one text-feature, but multiple. If I pass a single feature (not as list, like suggested in the solution to the other question) it works fine, how do I do it for multiple?

numeric_features = ['x0', 'x1', 'y0', 'y1']
categorical_features = []
text_features = ['text_feature', 'another_text_feature']

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('encoder', OneHotEncoder())])
text_transformer = Pipeline(steps=[('hashing', HashingVectorizer())])

preprocessor = ColumnTransformer(transformers=[
    ('numeric', numeric_transformer, numeric_features), 
    ('categorical', categorical_transformer, categorical_features),
    ('text', text_transformer, text_features)
])
    
steps = [('preprocessor', preprocessor),
         ('clf', SGDClassifier())]
    
pipeline = Pipeline(steps=steps)
    
pipeline.fit(X_train, y_train)

Solution

  • Just use a separate transformer for each text feature.

    preprocessor = ColumnTransformer(transformers=[
        ('numeric', numeric_transformer, numeric_features), 
        ('categorical', categorical_transformer, categorical_features),
        ('text', text_transformer, 'text_feature'),
        ('more_text', text_transformer, 'another_text_feature'),
    ])
    

    (The transformers get cloned during fitting, so you'll have two separate copies of text_transformer and everything is fine. If it worries you to specify the same transformer twice like this, you could always copy/clone it manually before specifying the ColumnTransformer.)