python scikit-learn pipeline grid-search

TypeError: If no scoring is specified, the estimator passed should have a 'score' method, when using CountVectorizer in a GridSearch

I'm practicing with some text using scikit-learn.

Towards getting more familiar with GridSearch, I am starting with some example code found here:

###############################################################################
# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer())
])

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0)
}
grid_search.fit(X_train, y_train)

print("Best score: %0.3f" % grid_search.best_score_)

Notice I am being very careful here, and I've only got one estimator and one parameter!

I'm finding that when I run this, I get the error:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None))]) does not.

Hummmm...why am I missing some sort of 'score' attribute?

When I check the possible parameters,

print CountVectorizer().get_params().keys()

I don't see anything where I can score, as was implied by this answer.

The documentation says By default, parameter search uses the score function of the estimator to evaluate a parameter setting. So why do I need to specify a score method?

Regardless, I thought I might need to explicity pass a scoring argument, but this didn't help and gave me an error: grid_search.fit(X_train, y_train, scoring=None)

I don't understand this error!

Solution

GridSearch maximizes a score over the grid of parameters. You have to specify what kind of score to use because there are many different types of scores possible. For example, for classification problems, you could use accuracy, f1-score, etc. Usually, score type is specified by passing a string in the scoring argument (see scoring parameter). Alternatively, model classes, like SVC or RandomForestRegressor, will have a .score() method. GridSearch will call that if no scoring argument is provided. However, that may or may not be the type of score that you want to optimize. There is also an option of passing in a function as the scoring argument if you have an unusual metric that you want GridSearch to use.

Transformers, like CountVectorizer, do not implement a score method, because they are just deterministic feature transformations. For the same reason, there aren't any scoring methods that make sense to apply to that type of object. You need a model class (or possibly a clustering algorithm) at the end of your pipeline for scoring to make sense.