python scikit-learn nlp xgboost gridsearchcv

Error when trying to run a GridSearchCV on sklearn Pipeline

I'm trying to run a sklearn pipeline with TFIDF vectorizer and XGBoost Classifier through a GridSearchCV, but it doesn't work because of an internal error. The data is 4000 sentences, marked either true or false (1 or 0). This is the code:

import numpy as np
import pandas as pd

from gensim import utils
import gensim.parsing.preprocessing as gsp

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator

from sklearn.feature_extraction.text import TfidfVectorizer

import xgboost as xgb

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

train = pd.read_csv("train_data.csv")
test = pd.read_csv("test_data.csv")
train_x = train.iloc[:, 0]
train_y = train.iloc[:, 1]

test_x = test.iloc[:, 0]
test_y = test.iloc[:, 1]

folds = 4

xgb_parameters = {
                'xgboost__n_estimators': [1000, 1500],
                'xgboost__max_depth': [12, 15],
                'xgboost__learning_rate': [0.1, 0.12],
                'xgboost__objective': ['binary:logistic']
}

model = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                         ('xgboost', xgb.XGBClassifier())])

gs_cv = GridSearchCV(estimator=model,
                     param_grid=xgb_parameters,
                     n_jobs=1,
                     refit=True,
                     cv=2,
                     scoring=f1_score)
gs_cv.fit(train_x, train_y)

But I am getting an error:

>>> gs_cv.fit(train_x, train_y)
C:\Users\draga\miniconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[21:31:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass labels=0       0
1       1
2       1
3       0
4       1
       ..
2004    0
2005    0
2008    0
2009    0
2012    0
Name: Bad Sentence, Length: 2000, dtype: int64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error       
  warnings.warn(f"Pass {args_msg} as keyword args. From version "
C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py:683: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 74, in inner_f
    return f(**kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1068, in f1_score
    return fbeta_score(y_true, y_pred, beta=1, labels=labels,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1192, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(y_true, y_pred,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1461, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1274, in _check_set_wise_labels
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 83, in _check_targets
    check_consistent_length(y_true, y_pred)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 192, in _num_samples
    raise TypeError(message)
TypeError: Expected sequence or array-like, got <class 'sklearn.pipeline.Pipeline'>

What could be the problem?
Do I need to include the transform method for TfidfVectorizer() in the pipeline?

Solution

The main problem is your scoring parameter for the search. Scorers for hyperparameter tuners in sklearn need to have the signature (estimator, X, y). You can use the make_scorer convenience function, or in this case just pass the name as a string, scorer="f1".

See the docs, the list of builtins and information on signatures.

(You do not need to explicitly use the transform method; that's handled internally by the pipeline.)