Search code examples
pythonscikit-learncross-entropy

log_loss in sklearn: Multioutput target data is not supported with label binarization


The following code

from sklearn import metrics
import numpy as np
y_true = np.array([[0.2,0.8,0],[0.9,0.05,0.05]])
y_predict = np.array([[0.5,0.5,0.0],[0.5,0.4,0.1]])
metrics.log_loss(y_true, y_predict)

produces the following error:

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-24beeb19448b> in <module>()
----> 1 metrics.log_loss(y_true, y_predict)

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\sklearn\metrics\classification.py in log_loss(y_true, y_pred, eps, normalize, sample_weight, labels)
   1646         lb.fit(labels)
   1647     else:
-> 1648         lb.fit(y_true)
   1649 
   1650     if len(lb.classes_) == 1:

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\sklearn\preprocessing\label.py in fit(self, y)
    276         self.y_type_ = type_of_target(y)
    277         if 'multioutput' in self.y_type_:
--> 278             raise ValueError("Multioutput target data is not supported with "
    279                              "label binarization")
    280         if _num_samples(y) == 0:

ValueError: Multioutput target data is not supported with label binarization

I am curious why. I am trying to re-read definition of log loss and cannot find anything that would make computations incorrect.


Solution

  • The source code indicates that metrics.log_loss does not support probabilities in y_true. It only supports binary indicators of shape (n_samples, n_classes), for example [[0,0,1],[1,0,0]] or class labels of shape (n_samples,), for example [2, 0]. In the latter case the class labels will be one-hot encoded to look like the indicator matrix before calculating log loss.

    In this block:

    lb = LabelBinarizer()
    
    if labels is not None:
        lb.fit(labels)
    else:
        lb.fit(y_true)
    

    You are reaching lb.fit(y_true), which will fail if y_true is not all 1 and/or 0. For example:

    >>> import numpy as np
    >>> from sklearn import preprocessing
    
    >>> lb = preprocessing.LabelBinarizer()
    
    >>> lb.fit(np.array([[0,1,0],[1,0,0]]))
    
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
    
    >>> lb.fit(np.array([[0.2,0.8,0],[0.9,0.05,0.05]]))
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/imran/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 278, in fit
        raise ValueError("Multioutput target data is not supported with "
    ValueError: Multioutput target data is not supported with label binarization
    

    I would define your own custom log loss function:

    def logloss(y_true, y_pred, eps=1e-15):
        y_pred = np.clip(y_pred, eps, 1 - eps)
        return -(y_true * np.log(y_pred)).sum(axis=1).mean()
    

    Here is the output on your data:

    >>> logloss(y_true, y_predict)
    0.738961717153653