Search code examples
pythonmachine-learningscikit-learnnlp

XGBClassifier ValueError: operands could not be broadcast together with shapes (2557,) (8,) (2557,)


I'm working on a text classification project.

While exploring different classifiers I came across XGBClassifier

My classification task is multi class. I'm getting the above mentioned error when trying to score the classifier - I'm guessing some reshaping is needed, but I fail to understand why. What's strange to me is that other classifiers work just fine (even this one with its default params)

Here's the relevant section from my code:

algorithms = [    
    svm.LinearSVC(),  # <<<=== Works    
    linear_model.RidgeClassifier(), # <<<=== Works    
    XGBClassifier(),  # <<<=== Works    
    XGBClassifier(objective='multi:softprob', num_class=len(groups_count_dict), eval_metric='merror')  # <<<=== Not working
]

def train(algorithm, X_train, y_train):
    model = Pipeline([       
        ('vect', transformer),
        ('classifier', OneVsRestClassifier(algorithm))
    ])
    model.fit(X_train, y_train)

    return model

score_dict = {}
algorithm_to_model_dict = {}
for algorithm in algorithms:
    print()
    print(f'trying {algorithm}')
    model = train(algorithm, X_train, y_train)
    score = model.score(X_test, y_test)
    score_dict[algorithm] = int(score * 100)
    algorithm_to_model_dict[algorithm] = model
    
sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
for classifier, score in sorted_score_dict.items():
    print(f'{classifier.__class__.__name__}: score is {score}%')

Here's the error again:

ValueError: operands could not be broadcast together with shapes (2557,) (8,) (2557,)

Not sure it's related but I'll mention it anyway - my transformer is being created as such:

tuples = []
tfidf_kwargs = {'ngram_range': (1, 2), 'stop_words': 'english', 'sublinear_tf': True}
for col in list(features.columns):
    tuples.append((f'vec_{col}', TfidfVectorizer(**tfidf_kwargs), col))

transformer = ColumnTransformer(tuples, remainder='passthrough')

Thanks in advance

EDIT:

Adding the full trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-576cd62f3df0> in <module>
     84     print(f'trying {algorithm}')
     85     model = train(algorithm, X_train, y_train)
---> 86     score = model.score(X_test, y_test)
     87     score_dict[algorithm] = int(score * 100)
     88     algorithm_to_model_dict[algorithm] = model

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    118 
    119         # lambda, but not partial, allows help() to work with update_wrapper
--> 120         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    121         # update the docstring of the returned function
    122         update_wrapper(out, self.fn)

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/pipeline.py in score(self, X, y, sample_weight)
    620         if sample_weight is not None:
    621             score_params['sample_weight'] = sample_weight
--> 622         return self.steps[-1][-1].score(Xt, y, **score_params)
    623 
    624     @property

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    498         """
    499         from .metrics import accuracy_score
--> 500         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    501 
    502     def _more_tags(self):

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/multiclass.py in predict(self, X)
    365             for i, e in enumerate(self.estimators_):
    366                 pred = _predict_binary(e, X)
--> 367                 np.maximum(maxima, pred, out=maxima)
    368                 argmaxima[maxima == pred] = i
    369             return self.classes_[argmaxima]

ValueError: operands could not be broadcast together with shapes (2557,) (8,) (2557,) 

Printing the shapes of X_test and y_test yields: (2557, 12) (2557,)

I was able to understand where does the (8,) comes from - it's the length of groups_count_dict


Solution

  • Turns out the solution was to remove the OneVsRestClassifier usage from the pipeline