If I plan to use cross validation (KFold), should I still split the dataset into training and test data and perform my training (including cross valid) only on the training set? Or will CV do everything for me? E.g.
Option 1
X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = GridSearchCV(... cv=5)
clf.fit(X_train, y_train)
Option 2
clf = GridSearchCV(... cv=5)
clf.fit(X y)
CV is good, but it's better to have train/test split to provide independent score estimation on the untouched data. If your CV and test data shows about the same score, then you can drop train/test split phase and CV on whole data to achieve slightly better model score. But don't do it before you sure your split and CV score is consistent.