python machine-learning scikit-learn cross-validation text-classification

Getting different score values between manual cross validation and cross_val_score

I created a python for loop to split the training dataset into stratified KFolds and used a classifier inside the loop to train it. Then used the trained model to predict with the validation data. The metrics achieved using this process where quite different to that achieved with the cross_val_score function. I expected the same results using both methods.

This code is for text classification and I use TF-IDF to vectorize the text

Code for manual implementation of cross validation:

#Importing metrics functions to measure performance of a  model
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import StratifiedKFold
data_validation = []  # list used to store the results of model validation using cross validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_val = []
f1_val = []

# use ravel function to flatten the multi-dimensional array to a single dimension
for train_index, val_index in (skf.split(X_train, y_train)):
    X_tr, X_val = X_train.ravel()[train_index], X_train.ravel()[val_index] 
    y_tr, y_val  = y_train.ravel()[train_index] , y_train.ravel()[val_index]
    tfidf=TfidfVectorizer()
    X_tr_vec_tfidf = tfidf.fit_transform(X_tr) # vectorize the training folds
    X_val_vec_tfidf = tfidf.transform(X_val) # vectorize the validation fold    
    #instantiate model 
    model= MultinomialNB(alpha=0.5, fit_prior=False) 
    #Training the empty model with our training dataset
    model.fit(X_tr_vec_tfidf, y_tr)  
    predictions_val = model.predict(X_val_vec_tfidf) # make predictions with the validation dataset
    acc_val = accuracy_score(y_val, predictions_val)
    accuracy_val.append(acc_val)
    f_val=f1_score(y_val, predictions_val)
    f1_val.append(f_val)

avg_accuracy_val = np.mean(accuracy_val)
avg_f1_val = np.mean(f1_val)

# temp list to store the metrics 
temp = ['NaiveBayes']
temp.append(avg_accuracy_val)   #validation accuracy score 
temp.append(avg_f1_val)         #validation f1 score
data_validation.append(temp)    
#Create a table ,using dataframe, which contains the metrics for all the trained and tested ML models
result = pd.DataFrame(data_validation, columns = ['Algorithm','Accuracy Score : Validation','F1-Score  : Validation'])
result.reset_index(drop=True, inplace=True)
result

Output:

    Algorithm   Accuracy Score : Validation     F1-Score : Validation
0   NaiveBayes  0.77012                      0.733994

Now code to use cross_val_score function:

from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
scores = ['accuracy', 'f1']
#Text vectorization of training and testing datasets using NLP technique TF-IDF
tfidf=TfidfVectorizer()
X_tr_vec_tfidf = tfidf.fit_transform(X_train)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nb=MultinomialNB(alpha=0.5, fit_prior=False) 
for score in ["accuracy", "f1"]:
    print (f'{score}: {cross_val_score(nb,X_tr_vec_tfidf,y_train,cv=skf,scoring=score).mean()} ')

Output:

accuracy: 0.7341283583255231 
f1: 0.7062017090972422

As can be seen the accuracy and f1 metrics are quite different using the two methods. The difference in metrics is much worse when I use the KNeighborsClassfier.

Solution

TL;DR: The two ways of calculation are not equivalent due to the different way you handle the TF-IDF transformation; the first calculation is the correct one.

In the first calculation you correctly apply fit_transform only to the training data of each fold, and transform to the validation data fold:

X_tr_vec_tfidf = tfidf.fit_transform(X_tr) # vectorize the training folds
X_val_vec_tfidf = tfidf.transform(X_val) # vectorize the validation fold

But in the second calculation you do not do that; instead, you apply fit_transform to the whole of the training data, before it is split to training and validation folds:

X_tr_vec_tfidf = tfidf.fit_transform(X_train)

hence the difference. The fact that you seem to get a better accuracy with the second, wrong way of calculation, is due to information leakage (your validation data is not actually unseen, they have participated in the TF-IDF transformation).

The correct way to use cross_val_score when we have transformations is via a pipeline (API, User's Guide):

from sklearn.pipeline import Pipeline

tfidf = TfidfVectorizer()
nb = MultinomialNB(alpha=0.5, fit_prior=False) 

pipeline = Pipeline([('transformer', tfidf), ('estimator', nb)])

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X_train, y_train, cv = skf)