I am learning about sklearn
custom transformers and read about the two core ways to create custom transformers:
BaseEstimator
and TransformerMixin
, orFunctionTransformer
.I wanted to compare these two approaches by implementing a "meta-vectorizer" functionality: a vectorizer that supports either CountVectorizer
or TfidfVectorizer
and transforms the input data according to the specified vectorizer type.
However, I can't seem to get any of the two work when passing them to a sklearn.pipeline.Pipeline
. I am getting the following error message in the fit_transform()
step:
ValueError: all the input array dimensions for the concatenation axis must match
exactly, but along dimension 0, the array at index 0 has size 6 and the array
at index 1 has size 1
My code for option 1 (using a custom class):
class Vectorizer(BaseEstimator, TransformerMixin):
def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
super().__init__()
self.vectorizer = vectorizer
self.ngram_range = ngram_range
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_vect_ = self.vectorizer.fit_transform(X.copy())
return X_vect_.toarray()
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(), ['Type']),
('comment_text_vectorizer', Vectorizer(), ['Text'])],
remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': \
[CountVectorizer(), TfidfVectorizer()]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)
And my code for option 2 (creating a custom transformer from a function using FunctionTransformer
):
def vectorize_text(X, vectorizer: Callable):
X_vect_ = vectorizer.fit_transform(X)
return X_vect_.toarray()
vectorizer_transformer = FunctionTransformer(vectorize_text, kw_args={'vectorizer': TfidfVectorizer()})
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(), ['Type']),
('comment_text_vectorizer', vectorizer_transformer, ['Text'])],
remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__kw_args': \
[{'vectorizer':CountVectorizer()}, {'vectorizer': TfidfVectorizer()}]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1').fit(X_train, y_train)
Imports and sample data:
import pandas as pd
from typing import Callable
import sklearn
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame([
['A99', 'hi i love python very much', 'c', 1],
['B07', 'which programming language should i learn', 'b', 0],
['A12', 'what is the difference between python django flask', 'b', 1],
['A21', 'i want to be a programmer one day', 'c', 0],
['B11', 'should i learn java or python', 'b', 1],
['C01', 'how much can i earn as a programmer with python', 'a', 0]
], columns=['Src', 'Text', 'Type', 'Target'])
X_vect_.toarray()
.The issue is that both CountVectorizer
and TfidfVectorizer
require their input to be 1D (and not 2D). In such cases the doc of ColumnTransformer
states that parameter columns
of the transformers
tuple should be passed as a string rather than as a list.
columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.
Therefore, the following will work in your case (i.e. changing ['Text']
into 'Text'
).
class Vectorizer(BaseEstimator, TransformerMixin):
def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
super().__init__()
self.vectorizer = vectorizer
self.ngram_range = ngram_range
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_vect_ = self.vectorizer.fit_transform(X.copy())
return X_vect_.toarray()
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(handle_unknown='ignore'), ['Type']),
('comment_text_vectorizer', Vectorizer(), 'Text')], remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': [CountVectorizer(), TfidfVectorizer()]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)
You can adjust the example with FunctionTransformer
accordingly. Observe, as a final remark, that I had to pass handle_unknown='ignore'
to OneHotEncoder
to prevent the possibility that an error would have arisen in case of unknown categories seen during the test phase of your cross-validation (and not seen during the training phase).