I am following Andrew NG instruction to evaluate the algorithm in Classification:
I am trying to apply this using Scikit-Learn
Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):
from sklearn import model_selection, svm
from sklearn.metrics import make_scorer, log_loss
from sklearn import datasets
def main():
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring=make_scorer(log_loss, greater_is_better=False))
print(results)
ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.
I am not sure even it's the right way to start. Any help is very much appreciated.
Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:
from sklearn import model_selection, svm
from sklearn import datasets
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring="accuracy") # change
Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).
For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.