python machine-learning cross-validation train-test-split

How is train_test_split with test_size=0 affecting the data?

I was using train_test_split in my code and then wanted to change it to cross validation, but something strange is hapenning.

train, test = train_test_split(data, test_size=0)

x_train = train.drop('CRO', axis=1)
y_train = train['CRO']

scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)

for k in range(1, 5):
    knn = neighbors.KNeighborsRegressor(n_neighbors=k, weights='uniform')
    scores = model_selection.cross_val_score(knn, x_train, y_train, cv=5)
    print(scores.mean(), 'score for k = ', k)

This code gives the scores around 0.8, but when I delete that first line and change the 'train' set for the 'data' set in the 2nd and 3rd lines, the score changes to 0.2, which is strange because I even set the test_size to 0 so the train should be equal to the whole data. What is hapenning?

Solution

One thing to be aware of are the implicit arguments passed in train_test_split.

By default, shuffle=True, which could easily be adding some noise into your training data by shuffling it, where just passing the data in without shuffling my be introducing some other pattern into the model.