Search code examples
python-3.xpandasscikit-learnnlptext-classification

Sklearn train_test_split split a dataset to compare predicted labels with ground truth labels


I am trying to perform a multi-class text classification using SVM with a small dataset by adapting from this guide. The input csv contains a 'text' column and a 'label' column (which have been manually assigned for this specific task).

One label needs to be assigned for each text entry. By using the LinearSVC model and TfidfVectorizer I obtained an accuracy score of 75% which seems more than expected for a very small dataset of only 400 samples. In order to further raise the accuracy I wanted to have a look at the entries that were not correctly classified but here I have an issue. Since I used train_test_split like this:

Train_X, Test_X, Train_Y, Test_Y = train_test_split(X, y, test_size=0.1, random_state = 1004)

I don't know which text entries have been used by the train_test_split function (as far as I understand the function chooses randomly the 10% entries for the test_size). So I don't know against which subset of the corpus original entries labels should I compare the list of predicted labels for the test dataset. In other words is there a method to enforce a subset to be assigned for the test_size i.e the last 40 entries from the 400 total entries in the dataset?

This would help to manually compare the predicted labels vs the ground truth labels.

Below is the code:

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

import pandas as pd
import numpy as np
import os


class Config:

    # Data and output directory config
    data_path = r'./take3/Data'
    code_train = r'q27.csv'



if __name__ == "__main__":


    print('--------Code classification--------\n')

    Corpus = pd.read_csv(os.path.join(Config.data_path, Config.code_train), sep = ',', encoding='cp1252', usecols=['text', 'label'])

    train_text = ['' if type(t) == float else t for t in Corpus['text'].values]


    # todo fine tunining
    tfidf = TfidfVectorizer(
        sublinear_tf=True,
        min_df=3, norm='l2',
        encoding='latin-1',
        ngram_range=(1, 2),
        stop_words='english')

    X = tfidf.fit_transform(train_text)             # Learn vocabulary and idf, return document-term matrix.

    # print('Array mapping from feature integer indices to feature name',tfidf.get_feature_names())
    print('X.shape:', X.shape)

    y = np.array(list(Corpus['label']))
    print('The corpus original labels:',y)
    print('y.shape:', y.shape)

    Train_X, Test_X, Train_Y, Test_Y = train_test_split(X, y, test_size=0.1, random_state = 1004)


    model = LinearSVC(random_state=1004)
    model.fit(Train_X, Train_Y)

    SVM_predict_test = model.predict(Test_X)
    accuracy = accuracy_score(Test_Y, SVM_predict_test, normalize=True, sample_weight=None)*100
    print('Predicted labels for the test dataset', SVM_predict_test)
    print("SVM accuracy score: {:.4f}".format(accuracy))

And this is the received output:


                         
--------Code classification--------

X.shape: (400, 136)
The corpus original labels: [15 20  9 14 98 12  3  4  4 22 99  3 98 20 99  1 10 20  8 15 98 12 18  7
 20 99  8  8 13  2  8  6 22  4 98  5 98 12 18  8 98 18 24  4  3 19 12  5
 20  6  8 15  5 14 19 22 16 10 24 16 98  8  8 16  2 20  4  8 20  6 22 98
  3 98 15 12  2 13  5  8  8  1 10 16 20 12  7 20 98 22 99 10 12  8  8 16
 16  4  4 99 20  8 16  2 12 15 16 10  5 22  8  7  7  4  5 12 16 14  1 10
 22 20  4  4  5 99 16  3  5 22 99  5  3  4  4  3  6 99  8 20  2 10 98  6
  6  8 99  3  8 99  2  5 15  6  6  7  8 14  9  4 20  3 99  5 98 15  5  5
 20 10  4 99 99 16 22  8 10 22 98 12  3  5  9 99 14  8  9 18 20 14 15 20
 20  1  6 23 22 20  6  1 18  8 12 10 15 10  6 10  3  4  8 24 14 22  5  3
 22 24 98 98 98  4 15 19  5  8  1 17 16  6 22 19  4  8  2 15 12 99 16  8
  9  1  8 22 14  5 20  2 10 10 22 12 98  3 19  5 98 14 19 22 18 16 98 16
  6  4 24 98 24 98 15  1  3 99  5 10 22  4 16 98 22  1  8  4 20  8  8  5
 20  4  3 20 22  4 20 12  7 21  5  4 16  8 22 20 99  5  6 99  8  3  4 99
  6  8 12  3 10  4  8  5 14 20  6 99  4  4  6  4 98 21  1 23 20 98 19  6
  4 22 98 98 20 10  8 10 19 16 14 98 14 12 10  4 22 14  3 98 10 20 98 10
  9  7  3  8  3  6  6 98  8 99  1 20 18  8  2  6 99 99 99 14 14 16 20 99
  1 98 23  6 12  4  1  3 99 99  3 22  5  7 16 99]
y.shape: (400,)
Predicted labels for the test dataset [ 1  8  5  4 15 10 14 12  6  8  8 16 98 20  7 99 99 12 99 24  4 98 99  3
 20  3  6 14 18 98 99 22  4 99  4 10 14  4  3 98]
SVM accuracy score: 75.0000

Solution

  • The default behavior of train_test_split is to split data into random train and test subsets. You can enforce a static subset split by setting shuffle=False and removing random_state.

    Train_X, Test_X, Train_Y, Test_Y = train_test_split(X, y, test_size=0.1, shuffle=False)
    

    See How to get a non-shuffled train_test_split in sklearn