I am new to the area in Machine Learning. My question is the following: I have built a model, and I am trying to optimize such model. By doing some research I found out that cross-validation could be used to help me avoid having an overfitted model. Moreover, Gridsearchcv could be used to help me optimize the parameters of such model and eventually identify the best possible parameters.
Now my question is should I do cross-validation first and then use grid search to identify the best parameters or using GridsearchCV would be enough given it performs cross-validation itself?
As suggested by @Noki, You can use the cv parameter in Grid Search CV.
GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated',
refit=True, cv=None, verbose=0,
pre_dispatch='2*n_jobs',error_score=nan,return_train_score=False)
Also the documentation clearly states that if it's a classification problem it will automatically ensure that it is stratified.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
However, there is something that i would like to add: You can make your K-folds dynamic with respect to your value count of your Y_target variable. You cannot have the lowest count of your frequency in K-fold as 1, it will throw an error while training. I have happened to face this. Use the below code snippet to help you with that.
For example
import pandas as pd
Y_target=pd.Series([0,1,1,1,1,0,0,0,6,6,6,6,6,6,6,6,6])
if Y_target.value_counts().iloc[-1]<2:
raise Exception ("No value can have frequency count as 1 in Y-target")
else:
Kfold=Y_target.value_counts().iloc[-1]
You can then assign Kfold to your cv parameter in Grid Search