Search code examples
pythonscikit-learncross-validationsklearn-pandastrain-test-split

Why is ShuffleSplit more/less random than train_test_split (with random_state=None)?


Consider the following two options presented:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

#sklearn.__version__ 17.1
#python --version 3.5.2, Anaconda 4.1.1 (64-bit)

#ipdb> TypeError: __init__() got an unexpected keyword argument 'n_splits'
#None
#> <string>(1)<module>()

import numpy as np
from sklearn.datasets import load_boston
#from sklearn.model_selection import train_test_split, cross_val_score
#from sklearn.model_selection import ShuffleSplit
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.cross_validation import ShuffleSplit
from sklearn.ensemble import GradientBoostingRegressor

# define feature matrix and target variable
X, y = load_boston().data, load_boston().target

# Create Algorithm Object (Gradient Boosting)
gbr = GradientBoostingRegressor(n_estimators=100, random_state=0)

#====================================================
# Option B
#====================================================
#shuffle = ShuffleSplit(n_splits=10, train_size=0.75, random_state=0)
shuffle = ShuffleSplit(n=X.shape[0], n_iter=10, train_size=0.75, random_state=0)
cross_val = cross_val_score(gbr, X, y, cv=shuffle)
print('------------------------------------------')
print('Individual performance: ', cross_val)
print('===============================================')
print('Option B: Average performance: ', cross_val.mean())
print('===============================================')
# --> different performance in every iteration because of different training
# and test sets.


#====================================================
# Option C
#====================================================
individual_results = []
iterations = np.arange(1, 11)

for i in iterations:
    # randomly split the data into train and test
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25,
                                                    random_state=None)
    # train gbr 10x with on new data set
    gbr.fit(Xtrain, ytrain)
    score = gbr.score(Xtrain, ytrain)
    individual_results.append(score)

avg_score = sum(individual_results)/len(iterations)
print('------------------------------------------')
print(individual_results)
print('===============================================')
print('Option C: Average Performance: ', avg_score)
print('===============================================')

Here is a copy of the output:

Individual performance:  [ 0.77535372  0.81760604  0.87146377  0.94041114  0.92648961  0.87761488
  0.82843891  0.81833855  0.90167889  0.90014986]
===============================================
Option B: Average performance:  0.865754537049
===============================================
------------------------------------------
[0.98094508160609573, 0.97773541952198795, 0.98076500920740906, 0.98313150025465956, 0.98097867267357952, 0.97918425360465322, 0.97923641784508919, 0.9785058355467865, 0.98173521302711486, 0.97866493105257402]
===============================================
Option C: Average Performance:  0.980088233434
===============================================

Can anyone help explain why the ShuffleSplit function in Option B presents more random results than the train_test_split function (with random_state=None) in Option C?


Solution

  • The score is calculated on Xtrain instead of XTest in Option C

    With

    score = gbr.score(Xtest, ytest)
    

    the scores are now

    [0.806, 0.906, 0.903, 0.836, 0.871, 0.920, 0.902, 0.901, 0.914, 0.916]