Search code examples
pythonscikit-learncross-validation

Should I first train_test_split and then use cross validation?


If I plan to use cross validation (KFold), should I still split the dataset into training and test data and perform my training (including cross valid) only on the training set? Or will CV do everything for me? E.g.

Option 1

X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = GridSearchCV(... cv=5) 
clf.fit(X_train, y_train)

Option 2

clf = GridSearchCV(... cv=5) 
clf.fit(X y)

Solution

  • CV is good, but it's better to have train/test split to provide independent score estimation on the untouched data. If your CV and test data shows about the same score, then you can drop train/test split phase and CV on whole data to achieve slightly better model score. But don't do it before you sure your split and CV score is consistent.