Search code examples
pythonscikit-learnnlpxgboostgridsearchcv

Error when trying to run a GridSearchCV on sklearn Pipeline


I'm trying to run a sklearn pipeline with TFIDF vectorizer and XGBoost Classifier through a GridSearchCV, but it doesn't work because of an internal error. The data is 4000 sentences, marked either true or false (1 or 0). This is the code:

import numpy as np
import pandas as pd

from gensim import utils
import gensim.parsing.preprocessing as gsp

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator

from sklearn.feature_extraction.text import TfidfVectorizer

import xgboost as xgb

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

train = pd.read_csv("train_data.csv")
test = pd.read_csv("test_data.csv")
train_x = train.iloc[:, 0]
train_y = train.iloc[:, 1]

test_x = test.iloc[:, 0]
test_y = test.iloc[:, 1]

folds = 4

xgb_parameters = {
                'xgboost__n_estimators': [1000, 1500],
                'xgboost__max_depth': [12, 15],
                'xgboost__learning_rate': [0.1, 0.12],
                'xgboost__objective': ['binary:logistic']
}

model = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                         ('xgboost', xgb.XGBClassifier())])

gs_cv = GridSearchCV(estimator=model,
                     param_grid=xgb_parameters,
                     n_jobs=1,
                     refit=True,
                     cv=2,
                     scoring=f1_score)
gs_cv.fit(train_x, train_y)

But I am getting an error:

>>> gs_cv.fit(train_x, train_y)
C:\Users\draga\miniconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[21:31:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass labels=0       0
1       1
2       1
3       0
4       1
       ..
2004    0
2005    0
2008    0
2009    0
2012    0
Name: Bad Sentence, Length: 2000, dtype: int64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error       
  warnings.warn(f"Pass {args_msg} as keyword args. From version "
C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py:683: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 74, in inner_f
    return f(**kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1068, in f1_score
    return fbeta_score(y_true, y_pred, beta=1, labels=labels,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1192, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(y_true, y_pred,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1461, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1274, in _check_set_wise_labels
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 83, in _check_targets
    check_consistent_length(y_true, y_pred)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 192, in _num_samples
    raise TypeError(message)
TypeError: Expected sequence or array-like, got <class 'sklearn.pipeline.Pipeline'> 
  1. What could be the problem?

  2. Do I need to include the transform method for TfidfVectorizer() in the pipeline?


Solution

  • The main problem is your scoring parameter for the search. Scorers for hyperparameter tuners in sklearn need to have the signature (estimator, X, y). You can use the make_scorer convenience function, or in this case just pass the name as a string, scorer="f1".

    See the docs, the list of builtins and information on signatures.

    (You do not need to explicitly use the transform method; that's handled internally by the pipeline.)