Search code examples
scikit-learnfeature-extractiongridsearchcvcountvectorizerscikit-learn-pipeline

Incompatible row dimensions when using passthrough in GridSearch over sklearn Pipeline with FeatureUnion


I am trying to do grid search over a sklearn pipeline that uses a custom transformer in a pipeline with FeatureUnion. It works fine when the pipeline uses the custom transformer class in FeatureUnion; however, it fails when the custom class is ignored in the pipeline by setting passthrough in the grid search parameters.

The full pipeline is defined as follows:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline, FeatureUnion

ngram_vectorizer = Pipeline([
    ("vectorizer", CountVectorizer(analyzer="char_wb", ngram_range=(1,3))), 
    ("tfidf", TfidfTransformer())
])

pipe_full = Pipeline(
    [
        ("features", FeatureUnion(
            [
                ("ngrams", ngram_vectorizer),
                ("lengths", TextLengthExtractor())
            ]
            )
        ),
        ("classifier", MultinomialNB())
    ]
)

The custom transformer class TextLengthExtractor simply computes the number of characters from an input string:

from sklearn.base import BaseEstimator, TransformerMixin
class TextLengthExtractor(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y = None):
        return self

    def transform(self, X, y = None):
        string_lengths = np.array([len(doc) for doc in X])
        return string_lengths.reshape(-1,1)

The tuning parameters for grid search are defined through a dictionary params. Importantly, the parameters for the custom TextLengthExtractor contain the passthrough option to ignore the entire features__lengths step from the pipeline (see also the sklearn's documentation on pipelines):

params = {
    "features__lengths": [TextLengthExtractor(), "passthrough"],
    "features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
}

When the pipeline is fit on the following dummy data

X_train_dummy = ["a", "ab", "a bc", "aaaaa", "b ab cc b", "ba", "baba", "cc bb aa", "c", "bca"]
y_train_dummy = [1,0,1, 1, 0, 1, 0, 1, 0, 0]
pipe_full.fit(X_train_dummy, y_train_dummy)

it can be seen that the lengths step of the FeatureUnion pipeline works as expected:

pipe_full["features"].get_params()["lengths"].transform(X_train_dummy)
# gives the following output of shape (10,1)
# array([[1], [2], [4], [5], [9], [2], [4], [8], [1], [3]])

However - and now comes the problem - when grid search is performed as follows:

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipe_full, params, cv=5, n_jobs=-1, verbose=10)
grid_search.fit(X_train_dummy, y_train_dummy)

all fits that ignore the lengths step (as defined by the passthrough option from params["features__lengths"] throw the following error:

5 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\dev\NameClassification\venv\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 870, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 1162, in fit_transform
    return self._hstack(Xs)
  File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 1216, in _hstack
    Xs = sparse.hstack(Xs).tocsr()
  File "C:\dev\NameClassification\venv\lib\site-packages\scipy\sparse\_construct.py", line 532, in hstack
    return bmat([blocks], format=format, dtype=dtype)
  File "C:\dev\NameClassification\venv\lib\site-packages\scipy\sparse\_construct.py", line 665, in bmat
    raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 8.

I do understand that both steps require identical row dimensions for both ngrams and lengths in the FeatureUnion, where the number of rows in the extracted feature matrices must equal the number of samples in the respective split. However, I have no idea how to control the shape of matrices when ignoring the lengths part of FeatureUnion using the passthrough option in the gird search params.

I have found any solution to the problem on SE or any other sklearn related resource. Does anyone have an idea on how to solve the issue?


Solution

  • I think I found the solution to the problem: To ignore an individual step in a FeatureUnion, the string drop rather than passthrough must be used. According to sklearn's documentation of FeatureUnion:

    Parameters of the transformers may be set using its name and the parameter name separated by a '__'. A transformer may be replaced entirely by setting the parameter with its name to another transformer, removed by setting to 'drop' or disabled by setting to 'passthrough' (features are passed without transformation).

    An example of dropping an entire transformer in FeatureUnion is also shown in sklearn's user guide on pipelines.

    In conclusion, to solve my problem, I had to replace passthrough with drop in the grid search parameter dictionary as follows

    Change from

    params = {
        "features__lengths": [TextLengthExtractor(), "passthrough"],
        "features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
    }
    

    to

    params = {
        "features__lengths": [TextLengthExtractor(), "drop"],
        "features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
    }