Search code examples
scikit-learnk-fold

Does K-Fold iteratively train a model


If you run cross-val_score() or cross_validate() on a dataset, is the estimator trained using all the folds at the end of the run?

I read somewhere that cross-val_score takes a copy of the estimator. Whereas I thought this was how you train a model using k-fold.

Or, at the end of the cross_validate() or cross_val_score() you have a single estimator and then use that for predict()

Is my thinking correct?


Solution

  • You can refer to sklearn-document here.

    If you do 3-Fold cross validation,

    • the sklearn will split your dataset to 3 parts. (For example, the 1st part contains 1st-3rd rows, 2nd part contains 4th-6th rows, and so on)
    • sklearn iterate to train new model 3 times with different training set and validation set
      • In the first round, it combine 1st and 2nd part together and use it as training set and test the model with 3rd part.
      • In the second round, it combine 1st and 3rd part together and use it as training set and test the model with 2nd part.
      • and so on.

    So, after using cross-validate, you will get three models. If you want the model objects of each round, you can add parameter return_estimato=True. The result which is the dictionary will have another key named estimator containing the list of estimator of each training.

    from sklearn import datasets, linear_model
    from sklearn.model_selection import cross_validate
    from sklearn.metrics import make_scorer
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import LinearSVC
    diabetes = datasets.load_diabetes()
    X = diabetes.data[:150]
    y = diabetes.target[:150]
    lasso = linear_model.Lasso()
    cv_results = cross_validate(lasso, X, y, cv=3, return_estimator=True)
    print(sorted(cv_results.keys()))
    #Output: ['estimator', 'fit_time', 'score_time', 'test_score']
    cv_results['estimator']
    #Output: [Lasso(), Lasso(), Lasso()]
    

    However, in practice, the cross validation method is used only for testing the model. After you found the good model and parameter setting that give you the high cross-validation score. It will be better if you fit the model with the whole training set again and test the model with the testing set.