Search code examples

scikit-learn cross-validation over-fitting or under-fitting

I'm using scikit-learn cross_validation and get for example 0.82 mean score (r2_scorer). How could I know do I have over-fitting or under-fitting using scikit-learn functions?


  • Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.

    You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.

    import numpy as np
    from sklearn.cross_validation import KFold
    from sklearn.metrics import SCORERS
    scorer = SCORERS['r2']
    cv = KFold(5)
    train_scores, test_scores = [], []
    for train, test in cv:[train], y[train])
        train_scores.append(scorer(regressor, X[train], y[train]))
        test_scores.append(scorer(regressor, X[test], y[test]))
    mean_train_score = np.mean(train_scores)
    mean_test_score = np.mean(test_scores)

    If you compute the mean train and test scores with cross validation you can then find out if you are:

    • Underfitting: the train score is far from the perfect score (which is 1.0 for r2)
    • Overfitting: the train and test scores are not close from on another (the mean test score is significantly lower than the mean train score).

    Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.