Search code examples
pythonmachine-learningscikit-learnloss-function

Evaluate Loss Function Value Getting From Training Set on Cross Validation Set


I am following Andrew NG instruction to evaluate the algorithm in Classification:

  1. Find the Loss Function of the Training Set.
  2. Compare it with the Loss Function of the Cross Validation.
  3. If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
  4. Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.

I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):

from sklearn import model_selection, svm
from sklearn.metrics import make_scorer, log_loss
from sklearn import datasets

def main():

    iris = datasets.load_iris()
    kfold = model_selection.KFold(n_splits=10, random_state=42)
    model= svm.SVC(kernel='linear', C=1)
    results = model_selection.cross_val_score(estimator=model,
                                              X=iris.data,
                                              y=iris.target,
                                              cv=kfold,
                                              scoring=make_scorer(log_loss, greater_is_better=False))

    print(results)

Error

ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.

I am not sure even it's the right way to start. Any help is very much appreciated.


Solution

  • Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:

    from sklearn import model_selection, svm
    from sklearn import datasets
    
    iris = datasets.load_iris()
    kfold = model_selection.KFold(n_splits=10, random_state=42)
    model= svm.SVC(kernel='linear', C=1)
    results = model_selection.cross_val_score(estimator=model,
                                                  X=iris.data,
                                                  y=iris.target,
                                                  cv=kfold,
                                                  scoring="accuracy")  # change 
    

    Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).

    For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.