My aim here is to create a pipeline to handle preprocessing and to do nested cross valiation to prevent information leak. I'm making one pipeline per model and then will compare the performances and pick the best model.
Questions:
#Organising data
X = df.drop('target', axis= 1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
pipe_rf = Pipeline([('scl', StandardScaler()),
('clf', RandomForestClassifier(random_state=42))])
param_range = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
grid_params_rf = [{'clf__criterion': ['gini', 'entropy'],
'clf__min_samples_leaf': param_range,
'clf__max_depth': param_range,
'clf__min_samples_split': param_range[1:]}]
gs_rf = GridSearchCV(estimator=pipe_rf,
param_grid=grid_params_rf,
scoring=['accuracy', 'f1', 'recall'],
refit = 'accuracy',
cv=10,
n_jobs=jobs)
gs_rf.fit(X_train, y_train)
#Get out training scores:
gs_rf.best_params_
gs_rf.best_score_ #train accuracy
#train f1
#train recall
#Find out how well it generalises by predicting using x_test and comparing predictions to y_test
y_predict = gs_rf.predict(x_test)
accuracy_score(y_test, y_predict) #test accuracy
recall_score(y_test, y_predict) #test recall
f1_score(y_test, y_predict) #test f1
#Evaluating the model (using this value to compare all of my different models, e.g. RF, SVM, DT)
scor = cross_validate(gs_rf, x_test, y_test, scoring=['accuracy', 'f1', 'recall'], cv=5, n_jobs = -1)
GridSearchCV
object retains the cross-validated scores for each metric for the best estimator. You can extract these scores using the cv_results_
attribute. Here's how you can access the training F1 and recall scores:f1_train_scores = gs_rf.cv_results_['mean_train_f1']
recall_train_scores = gs_rf.cv_results_['mean_train_recall']