Search code examples
pythonscikit-learnvalueerror

Issue with scikit-learn's BaggingClassifier and custom base estimator: operands can't be broadcast together?


I'm trying to use a custom classifier with SciKit-Learn's BaggingClassifier, and I'm getting an error which I cannot determine the source of. My classifier object passes check_estimator(), and I have no issue with the fit() function:

model = ensemble.BaggingClassifier(customEstimator, max_samples=1/n_estimators, n_estimators=n_estimators)
model.fit(trainfeat, trainlabels)
model.predict(testfeat)

This yields the below error trace. The base estimator itself makes binary predictions, via sigmoid threshold. I know that these values must correspond to the test data, but I don't understand what the three operators are supposed to be? And further, this seems like the error is coming from BaggingClassifier, but the issue must be from me, no?

I'm trying to avoid pasting the code for my entire estimator, but it inherits BaseEstimator and I only write/overload the functions: fit, predict, predict_proba. Am I missing something in this regard?

I've tried reshaping the features/labels to no avail, didn't even alter the error. I also attempted to have my estimator inherit ClassifierMixin but that ended up giving me a slew of new issues.

  File "Main_File.py", line 76, in <module>
    model.predict(testfeat)

  File "G:\Software\Anaconda\lib\site-packages\sklearn\multiclass.py", line 310, in predict
    indices.extend(np.where(_predict_binary(e, X) > thresh)[0])

  File "G:\Software\Anaconda\lib\site-packages\sklearn\multiclass.py", line 98, in _predict_binary
    score = estimator.predict_proba(X)[:, 1]

  File "G:\Software\Anaconda\lib\site-packages\sklearn\ensemble\bagging.py", line 698, in predict_proba
    for i in range(n_jobs))

  File "G:\Software\Anaconda\lib\site-packages\joblib\parallel.py", line 1003, in __call__
    if self.dispatch_one_batch(iterator):

  File "G:\Software\Anaconda\lib\site-packages\joblib\parallel.py", line 834, in dispatch_one_batch
    self._dispatch(tasks)

  File "G:\Software\Anaconda\lib\site-packages\joblib\parallel.py", line 753, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)

  File "G:\Software\Anaconda\lib\site-packages\joblib\_parallel_backends.py", line 201, in apply_async
    result = ImmediateResult(func)

  File "G:\Software\Anaconda\lib\site-packages\joblib\_parallel_backends.py", line 582, in __init__
    self.results = batch()

  File "G:\Software\Anaconda\lib\site-packages\joblib\parallel.py", line 256, in __call__
    for func, args, kwargs in self.items]

  File "G:\Software\Anaconda\lib\site-packages\joblib\parallel.py", line 256, in <listcomp>
    for func, args, kwargs in self.items]

  File "G:\Software\Anaconda\lib\site-packages\sklearn\ensemble\bagging.py", line 129, in _parallel_predict_proba
    proba += proba_estimator

ValueError: operands could not be broadcast together with shapes (100000,2) (100000,) (100000,2)

Solution

  • I guess the problem arises from the output of predict_proba of your customEstimator.

    Looks like your current implementation return output with a dimension (n_samples, 1), which is not compatible. Make sure your predict_proba output's dimension is (n_samples, 2) for binary classification problem.