machine-learning scikit-learn cross-validation tf-idf grid-search

GridSearchCV + StratifiedKfold in case of TFIDF

I am working on a classification problem where I need to predict the class of textual data. I need to do hyper parameter tuning for my classification model for which I am thinking to use GridSearchCV. I need to do StratifiedKFold as well because my data is imbalanced. I am aware of the fact that GridSearchCV internally uses StratifiedKFold if we have multiclass classification.

I have read here that in case of TfidfVectorizer we apply fit_transform to train data and only transform to test data.

This is what I have done below using StratifiedKFold.

skf = StratifiedKFold(n_splits=5, random_state=5)

for train_index, test_index in skf.split(X, y):
    iteration = iteration+1
    print(f"Iteration number {iteration}")
    X_train, y_train = X.iloc[train_index], y.iloc[train_index]
    X_test, y_test = X.iloc[test_index], y.iloc[test_index]

    train_tfid = tfidf_vectorizer.fit_transform(X_train.values.astype('U'))
    test_tfid = tfidf_vectorizer.transform(X_test.values.astype('U'))

    svc_model = linear_model.SGDClassifier()
    svc_model.fit(train_tfid, y_train.values.ravel())

The accuracy/f1 I am getting is not good so thought of doing hyper parameter tuning using GridSearchCV. In GridSearchCV we do

c_space = np.logspace(-5, 8, 15) 
param_grid = {'C': c_space} 

# Instantiating logistic regression classifier 
logreg = LogisticRegression() 

# Instantiating the GridSearchCV object 
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5) 

logreg_cv.fit(X, y)

According to me logreg_cv.fit(X, y) would internally split the X in X_train, X_test k times and then would do predictions to give us the best estimator.

In my case what should be X? If it's X which is generated after fit_transform then internally when X is split into train and test, the test data has undergone fit_transform but ideally it should undergo only transform.

My concern is that in my case, inside GridSearchCV how would I be able to control that fit_transform is applied only to train data and transform is applied to test data (validation data).

because if it internally applies fit_transform to entire data then it is not a good practise.

Solution

This is an exact scenario where you should be using Pipeline in GridSearchCV. First, create a pipeline with the required steps such as data preprocessing, feature selection and model. Once you call GridSearchCV on this pipeline, it will do the data processing only on training folds and then fit with the model.

Read here to understand more about the model selection module in sklearn.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=cats)
X, y = newsgroups_train.data, newsgroups_train.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, stratify=y)


my_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])


parameters = {'clf__C': np.logspace(-5, 8, 15)}

grid_search = GridSearchCV(my_pipeline, param_grid=parameters,
                           cv=10, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)
# {'clf__C': 0.4393970560760795}

grid_search.score(X_test, y_test)
# 0.8981481481481481