I'm working on a text classification project.
While exploring different classifiers I came across XGBClassifier
My classification task is multi class. I'm getting the above mentioned error when trying to score the classifier - I'm guessing some reshaping is needed, but I fail to understand why. What's strange to me is that other classifiers work just fine (even this one with its default params)
Here's the relevant section from my code:
algorithms = [
svm.LinearSVC(), # <<<=== Works
linear_model.RidgeClassifier(), # <<<=== Works
XGBClassifier(), # <<<=== Works
XGBClassifier(objective='multi:softprob', num_class=len(groups_count_dict), eval_metric='merror') # <<<=== Not working
]
def train(algorithm, X_train, y_train):
model = Pipeline([
('vect', transformer),
('classifier', OneVsRestClassifier(algorithm))
])
model.fit(X_train, y_train)
return model
score_dict = {}
algorithm_to_model_dict = {}
for algorithm in algorithms:
print()
print(f'trying {algorithm}')
model = train(algorithm, X_train, y_train)
score = model.score(X_test, y_test)
score_dict[algorithm] = int(score * 100)
algorithm_to_model_dict[algorithm] = model
sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
for classifier, score in sorted_score_dict.items():
print(f'{classifier.__class__.__name__}: score is {score}%')
Here's the error again:
ValueError: operands could not be broadcast together with shapes (2557,) (8,) (2557,)
Not sure it's related but I'll mention it anyway - my transformer
is being created as such:
tuples = []
tfidf_kwargs = {'ngram_range': (1, 2), 'stop_words': 'english', 'sublinear_tf': True}
for col in list(features.columns):
tuples.append((f'vec_{col}', TfidfVectorizer(**tfidf_kwargs), col))
transformer = ColumnTransformer(tuples, remainder='passthrough')
Thanks in advance
EDIT:
Adding the full trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-576cd62f3df0> in <module>
84 print(f'trying {algorithm}')
85 model = train(algorithm, X_train, y_train)
---> 86 score = model.score(X_test, y_test)
87 score_dict[algorithm] = int(score * 100)
88 algorithm_to_model_dict[algorithm] = model
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
118
119 # lambda, but not partial, allows help() to work with update_wrapper
--> 120 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
121 # update the docstring of the returned function
122 update_wrapper(out, self.fn)
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/pipeline.py in score(self, X, y, sample_weight)
620 if sample_weight is not None:
621 score_params['sample_weight'] = sample_weight
--> 622 return self.steps[-1][-1].score(Xt, y, **score_params)
623
624 @property
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
498 """
499 from .metrics import accuracy_score
--> 500 return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
501
502 def _more_tags(self):
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/multiclass.py in predict(self, X)
365 for i, e in enumerate(self.estimators_):
366 pred = _predict_binary(e, X)
--> 367 np.maximum(maxima, pred, out=maxima)
368 argmaxima[maxima == pred] = i
369 return self.classes_[argmaxima]
ValueError: operands could not be broadcast together with shapes (2557,) (8,) (2557,)
Printing the shapes of X_test
and y_test
yields: (2557, 12) (2557,)
I was able to understand where does the (8,)
comes from - it's the length of groups_count_dict
Turns out the solution was to remove the OneVsRestClassifier
usage from the pipeline