I am working on a classification problem where I need to predict the class of textual data. I need to do hyper parameter tuning for my classification model for which I am thinking to use GridSearchCV
. I need to do StratifiedKFold
as well because my data is imbalanced. I am aware of the fact that GridSearchCV
internally uses StratifiedKFold
if we have multiclass classification.
I have read here that in case of TfidfVectorizer
we apply fit_transform
to train data and only transform to test data.
This is what I have done below using StratifiedKFold
.
skf = StratifiedKFold(n_splits=5, random_state=5)
for train_index, test_index in skf.split(X, y):
iteration = iteration+1
print(f"Iteration number {iteration}")
X_train, y_train = X.iloc[train_index], y.iloc[train_index]
X_test, y_test = X.iloc[test_index], y.iloc[test_index]
train_tfid = tfidf_vectorizer.fit_transform(X_train.values.astype('U'))
test_tfid = tfidf_vectorizer.transform(X_test.values.astype('U'))
svc_model = linear_model.SGDClassifier()
svc_model.fit(train_tfid, y_train.values.ravel())
The accuracy/f1 I am getting is not good so thought of doing hyper parameter tuning using GridSearchCV. In GridSearchCV we do
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
# Instantiating logistic regression classifier
logreg = LogisticRegression()
# Instantiating the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)
logreg_cv.fit(X, y)
According to me logreg_cv.fit(X, y)
would internally split the X in X_train
, X_test
k times and then would do predictions to give us the best estimator.
In my case what should be X? If it's X which is generated after fit_transform
then internally when X is split into train and test, the test data has undergone fit_transform
but ideally it should undergo only transform.
My concern is that in my case, inside GridSearchCV
how would I be able to control that fit_transform
is applied only to train data and transform is applied to test data (validation data).
because if it internally applies fit_transform to entire data then it is not a good practise.
This is an exact scenario where you should be using Pipeline
in GridSearchCV
. First, create a pipeline with the required steps such as data preprocessing, feature selection and model. Once you call GridSearchCV
on this pipeline, it will do the data processing only on training folds and then fit with the model.
Read here to understand more about the model selection module in sklearn.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
remove=('headers', 'footers', 'quotes'),
categories=cats)
X, y = newsgroups_train.data, newsgroups_train.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.1, stratify=y)
my_pipeline = Pipeline([
('vectorizer', CountVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {'clf__C': np.logspace(-5, 8, 15)}
grid_search = GridSearchCV(my_pipeline, param_grid=parameters,
cv=10, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
# {'clf__C': 0.4393970560760795}
grid_search.score(X_test, y_test)
# 0.8981481481481481