I'm having some difficulty with cross_val_score()
in sklearn
.
I have instantiated a KNeighborsClassifier
with the following code:
clf = KNeighborsClassifier(n_neighbors=28)
I am then using cross validation to understand the accuracy of this classifier on my df
of features (x
) and target series (y
) with the following:
cv_score_av = np.mean(cross_val_score(clf, x, y, cv=5))
Each time I run the script I was hoping to achieve a different result, however there is not an option to set random_state=None
like there is with RandomForestClassifier()
for example. Is there a way to achieve a different result with each run or am I going to have to manually shuffle my data randomly prior to running cross_val_score
on my KNeighborsClassifier
model.
There seems to be some misunderstanding here from your part; the random_state
argument in the Random Forest refers to the algorithm itself, and not to the cross validation part. Such an argument is necessary here, since RF includes indeed some randomness in model building (a lot of it, in fact, as already implied by the very name of the alforithm); but knn, in contrast, is a deterministic algorithm, so in principle there is no need for it to use any random_state
.
That said, your question is indeed valid; I have commented in the past on this annoying and inconvenient absence of a shuffling argument in cross_val_score
. Digging into the documentation, we see that under the hood, the function uses either StratifiedKFold
or KFold
to build the folds:
cv : int, cross-validation generator or an iterable, optional
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used.
and both of these functions, as you can easily see from the linked documentation pages, use shuffle=False
as default value.
Anyway, the solution is simple, consisting of a single additional line of code; you just need to replace cv=5
with a call to a previously defined StratifiedKFold
object with shuffle=True
:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True)
cv_score_av = np.mean(cross_val_score(ml_10_knn, x, y, cv=skf))