Search code examples
pythonmachine-learningscikit-learnprobability

"Too many indices for array" error in make_scorer function in Sklearn


Goal: use brier score loss to train a random forest algorithm using GridSearchCV

Issue: The probability prediction for target "y" is the wrong dimension when using make_scorer.

After looking at this question, I am using its suggested proxy function to use GridSearchCV trained with brier score loss. Below is an example of a setup:

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import brier_score_loss,make_scorer
from sklearn.ensemble import RandomForestClassifier
import numpy as np

def ProbaScoreProxy(y_true, y_probs, class_idx, proxied_func, **kwargs):
    return proxied_func(y_true, y_probs[:, class_idx], **kwargs)

brier_scorer = make_scorer(ProbaScoreProxy, greater_is_better=False, \
                           needs_proba=True, class_idx=1, proxied_func=brier_score_loss)

X = np.random.randn(100,2)
y = (X[:,0]>0).astype(int)

random_forest = RandomForestClassifier(n_estimators=10)

random_forest.fit(X,y)

probs = random_forest.predict_proba(X)

Now passing the probs and y directly to either brier_score_loss or ProbaScoreProxy will not result in an error:

ProbaScoreProxy(y,probs,1,brier_score_loss)

outputs:

0.0006

Now pass it through brier_scorer:

brier_scorer(random_forest,X,y)

output:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-28-1474bb08e572> in <module>()
----> 1 brier_scorer(random_forest,X,y)

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/_scorer.py in __call__(self, estimator, X, y_true, sample_weight)
    167                           stacklevel=2)
    168         return self._score(partial(_cached_call, None), estimator, X, y_true,
--> 169                            sample_weight=sample_weight)
    170 
    171     def _factory_args(self):

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/_scorer.py in _score(self, method_caller, clf, X, y, sample_weight)
    258                                                  **self._kwargs)
    259         else:
--> 260             return self._sign * self._score_func(y, y_pred, **self._kwargs)
    261 
    262     def _factory_args(self):

<ipython-input-25-5321477444e1> in ProbaScoreProxy(y_true, y_probs, class_idx, proxied_func, **kwargs)
      5 
      6 def ProbaScoreProxy(y_true, y_probs, class_idx, proxied_func, **kwargs):
----> 7     return proxied_func(y_true, y_probs[:, class_idx], **kwargs)
      8 
      9 brier_scorer = make_scorer(ProbaScoreProxy, greater_is_better=False,                            needs_proba=True, class_idx=1, proxied_func=brier_score_loss)

IndexError: too many indices for array

So it seems like something is happening in make_scorer to change the dimension of its probability input, but I can't seem to see what the problem is.

Versions: - sklearn: '0.22.2.post1' - numpy: '1.18.1'

Note that here y is the correct dimension (1-d) and you'll find by fiddling around that its the dimension of y_probs that's being passed in to ProbaScoreProxy that causes the issue.

Is this just badly written code from that last question? What ultimately is the way to have a make_score object that something like GridSearchCV will accept to train an RF?


Solution

  • Goal: use brier score loss to train a random forest algorithm using GridSearchCV

    For this goal, you can use the string value 'neg_brier_score' in GridSearchCV scoring parameter directly.

    For example:

    gc = GridSearchCV(random_forest,
                      param_grid={"n_estimators":[5, 10]},
                      scoring="neg_brier_score")
    
    gc.fit(X, y)
    print(gc.scorer_) 
    # make_scorer(brier_score_loss, greater_is_better=False, needs_proba=True)