Search code examples
pythonscikit-learncross-validation

Discrepancy between KFlold on the one hand and KFold with shuffle=True and RepeatedKFold on the other hand in sklearn


I am comparing KFlold and RepeatedKFold using sklearn version 0.22. According to the documentation: RepeatedKFold "Repeats K-Fold n times with different randomization in each repetition." One would expect the results from running RepeatedKFold with only 1 repeat (n_repeats = 1) to be pretty much identical to KFold.

I ran a simple comparison:

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold, RepeatedStratifiedKFold
from sklearn import metrics

X, y = load_digits(return_X_y=True)

classifier = SGDClassifier(loss='hinge', penalty='elasticnet',  fit_intercept=True)
scorer = metrics.accuracy_score
results = []
n_splits = 5
kf = KFold(n_splits=n_splits)
for train_index, test_index in kf.split(X, y):
    x_train, y_train = X[train_index], y[train_index]
    x_test, y_test = X[test_index], y[test_index]
    classifier.fit(x_train, y_train)
    results.append(scorer(y_test, classifier.predict(x_test)))
print ('KFold')
print('mean = ', np.mean(results))
print('std = ', np.std(results))
print()

results = []
n_repeats = 1
rkf = RepeatedKFold(n_splits=n_splits, n_repeats = n_repeats)
for train_index, test_index in rkf.split(X, y):
    x_train, y_train = X[train_index], y[train_index]
    x_test, y_test = X[test_index], y[test_index]
    classifier.fit(x_train, y_train)
    results.append(scorer(y_test, classifier.predict(x_test)))
print ('RepeatedKFold')
print('mean = ', np.mean(results))
print('std = ', np.std(results))

The output is

KFold
mean =  0.9082079851439182
std =  0.04697225962068869

RepeatedKFold
mean =  0.9493562364593006
std =  0.017732595698953055

I repeated this experiment enough times to see that the difference is statistically significant.

I was trying to read and reread the documentation to see if I'm missing something but to no avail.

Btw, the same holds true for StratifiedKFold and RepeatedStratifiedKFold:

StratifiedKFold
mean =  0.9159935004642525
std =  0.026687786392525545

RepeatedStratifiedKFold
mean =  0.9560476632621479
std =  0.014405630805910506

For this data set, StratifiedKFold agrees with KFold; RepeatedStratifiedKFold agrees with RepeatedSKFold.

UPDATE Following the suggestion from @Dan and @SergeyBushmanov, I included shuffle and random_state

def run_nfold(X,y, classifier, scorer, cv,  n_repeats):
    results = []
    for n in range(n_repeats):
        for train_index, test_index in cv.split(X, y):
            x_train, y_train = X[train_index], y[train_index]
            x_test, y_test = X[test_index], y[test_index]
            classifier.fit(x_train, y_train)
            results.append(scorer(y_test, classifier.predict(x_test)))    
    return results
kf = KFold(n_splits=n_splits)
results_kf = run_nfold(X,y, classifier, scorer, kf, 10)
print('KFold mean = ', np.mean(results_kf))

kf_shuffle = KFold(n_splits=n_splits, shuffle=True, random_state = 11)
results_kf_shuffle = run_nfold(X,y, classifier, scorer, kf_shuffle, 10)
print('KFold Shuffled mean = ', np.mean(results_kf_shuffle))

rkf = RepeatedKFold(n_splits=n_splits, n_repeats = n_repeats, random_state = 111)
results_kf_repeated = run_nfold(X,y, classifier, scorer, rkf, 10)
print('RepeatedKFold mean = ', np.mean(results_kf_repeated)

produces

KFold mean =  0.9119255648406066
KFold Shuffled mean =  0.9505304859176724
RepeatedKFold mean =  0.950754100897555

Moreover, using Kolmogorov-Smirnov test:

print ('Compare KFold with KFold shuffled results')
ks_2samp(results_kf, results_kf_shuffle)
print ('Compare RepeatedKFold with KFold shuffled results')
ks_2samp(results_kf_repeated, results_kf_shuffle)

shows that KFold shuffled and RepeatedKFold (which looks it is is shuffled by default, you are right @Dan) are statistically the same, whereas the default non-shuffled KFold produces statistically significant lower result:

Compare KFold with KFold shuffled results
Ks_2sampResult(statistic=0.66, pvalue=1.3182765881237494e-10)

Compare RepeatedKFold with KFold shuffled results
Ks_2sampResult(statistic=0.14, pvalue=0.7166468440414822)

Now, note that I used different random_state for KFold and RepeatedKFold. So, the answer, or rather the partial answer, is that the difference in results is due to shuffling vs non-shuffling. Which makes sense, since using different random_state can change the exact split, and it shouldn't change the statistical properties, like the mean of multiple runs.

I'm now confused by why shuffling causes this effect. I've changed the title of the question to reflect this confusion ( I hope it doesn't break any stackoverflow rules, but I don't want to create another question).

UPDATE I agree with @SergeyBushmanov's suggestion. I posted it as a new question


Solution

  • To make RepeatedKFold results similar to KFold you have to:

    np.random.seed(42)
    n = np.random.choice([0,1],10,p=[.5,.5])
    kf = KFold(2,shuffle=True, random_state=42)
    list(kf.split(n))
    [(array([2, 3, 4, 6, 9]), array([0, 1, 5, 7, 8])),
     (array([0, 1, 5, 7, 8]), array([2, 3, 4, 6, 9]))]
    
    kfr = RepeatedKFold(n_splits=2, n_repeats=1, random_state=42)
    list(kfr.split(n))
    [(array([2, 3, 4, 6, 9]), array([0, 1, 5, 7, 8])),
     (array([0, 1, 5, 7, 8]), array([2, 3, 4, 6, 9]))]
    

    RepeatedKFold uses KFold to generate folds, you only need to make it sure both have similar random_state.