machine-learning scikit-learn text-classification

How can I split documents into training set and test set?

I am trying to build a classification model. I have 1000 text documents in local folder. I want to divide them into training set and test set with a split ratio of 70:30(70 → Training and 30 → Test). What is the better approach to do so? I am using Python.

I wanted a approach programmatically to split the training set and test set. First to read the files in local directory. Second, to build a list of those files and shuffle them. Thirdly to split them into a training set and test set.

I tried a few ways by using built-in Python keywords and functions only to fail. Lastly, I got the idea of approaching it. Also cross-validation is a good option to be considered for the building general classification models.

Solution

There will be a few steps:

Get a list of the files
Randomize the files
Split files into training and testing sets
Do the thing

1. Get a list of the files

Let's assume that your files all have the extension .data and they're all in the folder /ml/data/. We want to get a list of all of these files. This is done simply with the os module. I'm assuming you don't have any subdirectories; this would change if there were.

import os

def get_file_list_from_dir(datadir):
    all_files = os.listdir(os.path.abspath(datadir))
    data_files = list(filter(lambda file: file.endswith('.data'), all_files))
    return data_files

So if we were to call get_file_list_from_dir('/ml/data'), we would get back a list of all the .data files in that directory (equivalent in the shell to the glob /ml/data/*.data).

2. Randomize the files

We don't want the sampling to be predictable, as that is considered a poor way to train an ML classifier.

from random import shuffle

def randomize_files(file_list):
    shuffle(file_list)

Note that random.shuffle performs an in-place shuffling, so it modifies the existing list. (Of course this function is rather silly since you could just call shuffle instead of randomize_files; you can write this into another function to make it make more sense.)

3. Split files into training and testing sets

I'll assume a 70:30 ratio instead of any specific number of documents. So:

from math import floor

def get_training_and_testing_sets(file_list):
    split = 0.7
    split_index = floor(len(file_list) * split)
    training = file_list[:split_index]
    testing = file_list[split_index:]
    return training, testing

4. Do the thing

This is the step where you open each file and do your training and testing. I'll leave this to you!

Cross-Validation

Out of curiosity, have you considered using cross-validation? This is a method of splitting your data so that you use every document for training and testing. You can customize how many documents are used for training in each "fold". I could go more into depth on this if you like, but I won't if you don't want to do it.

All right, since you requested it, I will explain this a little bit more.

So we have a 1000-document set of data. The idea of cross-validation is that you can use all of it for both training and testing — just not at once. We split the dataset into what we call "folds". The number of folds determines the size of the training and testing sets at any given point in time.

Let's say we want a 10-fold cross-validation system. This means that the training and testing algorithms will run ten times. The first time will train on documents 1-100 and test on 101-1000. The second fold will train on 101-200 and test on 1-100 and 201-1000.

If we did, say, a 40-fold CV system, the first fold would train on document 1-25 and test on 26-1000, the second fold would train on 26-40 and test on 1-25 and 51-1000, and on.

To implement such a system, we would still need to do steps (1) and (2) from above, but step (3) would be different. Instead of splitting into just two sets (one for training, one for testing), we could turn the function into a generator — a function which we can iterate through like a list.

def cross_validate(data_files, folds):
    if len(data_files) % folds != 0:
        raise ValueError(
            "invalid number of folds ({}) for the number of "
            "documents ({})".format(folds, len(data_files))
        )
    fold_size = len(data_files) // folds
    for split_index in range(0, len(data_files), fold_size):
        training = data_files[split_index:split_index + fold_size]
        testing = data_files[:split_index] + data_files[split_index + fold_size:]
        yield training, testing

That yield keyword at the end is what makes this a generator. To use it, you would use it like so:

def ml_function(datadir, num_folds):
    data_files = get_file_list_from_dir(datadir)
    randomize_files(data_files)
    for train_set, test_set in cross_validate(data_files, num_folds):
        do_ml_training(train_set)
        do_ml_testing(test_set)

Again, it's up to you to implement the actual functionality of your ML system.