Search code examples
pythonscikit-learnrandom-forestcross-validationnearest-neighbor

cross_val_score behaves differently with different classifiers in sklearn


I'm having some difficulty with cross_val_score() in sklearn.

I have instantiated a KNeighborsClassifier with the following code:

clf = KNeighborsClassifier(n_neighbors=28)

I am then using cross validation to understand the accuracy of this classifier on my df of features (x) and target series (y) with the following:

cv_score_av = np.mean(cross_val_score(clf, x, y, cv=5))

Each time I run the script I was hoping to achieve a different result, however there is not an option to set random_state=None like there is with RandomForestClassifier() for example. Is there a way to achieve a different result with each run or am I going to have to manually shuffle my data randomly prior to running cross_val_score on my KNeighborsClassifier model.


Solution

  • There seems to be some misunderstanding here from your part; the random_state argument in the Random Forest refers to the algorithm itself, and not to the cross validation part. Such an argument is necessary here, since RF includes indeed some randomness in model building (a lot of it, in fact, as already implied by the very name of the alforithm); but knn, in contrast, is a deterministic algorithm, so in principle there is no need for it to use any random_state.

    That said, your question is indeed valid; I have commented in the past on this annoying and inconvenient absence of a shuffling argument in cross_val_score. Digging into the documentation, we see that under the hood, the function uses either StratifiedKFold or KFold to build the folds:

    cv : int, cross-validation generator or an iterable, optional

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

    and both of these functions, as you can easily see from the linked documentation pages, use shuffle=False as default value.

    Anyway, the solution is simple, consisting of a single additional line of code; you just need to replace cv=5 with a call to a previously defined StratifiedKFold object with shuffle=True:

    from sklearn.model_selection import StratifiedKFold
    
    skf = StratifiedKFold(n_splits=5, shuffle=True)
    cv_score_av = np.mean(cross_val_score(ml_10_knn, x, y, cv=skf))