Search code examples
pythonmachine-learningcopy

shutil.copy doesn't seem to copy all the expected files but doesn't throw an exception


I'm writing a simple function to split a simple image binary classification kaggle dataset (using a native kaggle notebook) into a training, validation and test set. For that, for each of the two classes (whose images are stored in different folders), I'm shuffling the images and then making the splits. Finally, I'm trying to copy each image into my working folder in the appropriate sub-folders.

Here's my code to do that:

def shuffle_split_data(source_path, dest_path, train_split, val_split, test_split, clear_existing_destination=True):
    # remove previous files in working folder (if any)
    if clear_existing_destination:
        for item in os.listdir(dest_path):
            item_path = os.path.join(dest_path, item)
            if os.path.isfile(item_path):
                os.remove(item_path)
            elif os.path.isdir(item_path):
                rmtree(item_path)

    for label in os.listdir(source_path):
        train_path = os.path.join(dest_path, "training", label)
        val_path = os.path.join(dest_path, "validation", label)
        test_path = os.path.join(dest_path, "test", label)
        
        os.makedirs(train_path, exist_ok=True)
        os.makedirs(val_path, exist_ok=True)
        os.makedirs(test_path, exist_ok=True)
        
        label_source_path = os.path.join(source_path, label)
        examples_path = os.listdir(label_source_path)
        examples_path = np.random.choice(examples_path, len(examples_path)) # shuffle
        total_examples = len(examples_path)
        
        n_train_examples = int(train_split * total_examples)
        n_val_examples = int(val_split * total_examples)
        # n_test_examples = int(test_split * total_examples)
        
        train_examples = examples_path[:n_train_examples]
        val_examples = examples_path[n_train_examples:(n_train_examples+n_val_examples+1)]
        test_examples = examples_path[(n_train_examples+n_val_examples+1):]
        
        print(len(train_examples), len(val_examples), len(test_examples))
        
        for file in train_examples:
            source = os.path.join(label_source_path, file)
            dest = os.path.join(train_path, file)
            
            copy(source, dest)
        
        for file in val_examples:
            source = os.path.join(label_source_path, file)
            dest = os.path.join(val_path, file[:-4] + "_" + label + ".jpg")
            
            copy(source, dest)
            
        for file in test_examples:
            source = os.path.join(label_source_path, file)
            dest = os.path.join(test_path, file[:-4] + "_" + label + ".jpg")
            
            copy(source, dest)
        
shuffle_split_data(
    '/kaggle/input/monkeypox-skin-lesion-dataset/Original Images/Original Images',
    '/kaggle/working/',
    train_split,
    val_split,
    test_split,
    clear_existing_destination=True
)

The output to console is simply:

100 13 13
81 11 10

which are the expected number of images that the training, validation and test sets are expected to have for each of the labels/classes.

I'm assuming that shutil.copy isn't copying all the images properly, because when I examine my training folder for the first class (the one that should have 100 images for training) using the following code:

print(len(os.listdir('/kaggle/working/training/Others/')))

The output is 70. For some reason, this number changes every time that I run the function above.

What am I missing? I assume the error must be really dumb, but I've been debugging and trying to locate it for the past hours and haven't been able to make any progress. Thank you in advance!


Solution

  • The problem here is the way that np.random.choice is called.

    With the default parameter replace=True, "sampling with replacement" is performed, i.e. "a value of a can be selected multiple times" (documentation).

    So after the line:

    examples_path = np.random.choice(examples_path, len(examples_path))
    

    the array examples_path contains the same number of filenames it did before, but some filenames will be duplicated and others will be missing.

    One fix is to supply the parameter replace=False:

    examples_path = np.random.choice(examples_path, len(examples_path), replace=False)
    

    But it's probably clearer and more concise to switch to the function np.random.shuffle instead:

    np.random.shuffle(examples_path)
    

    This shuffles the array in place (i.e. it doesn't return a result, it modifies the existing examples_path) and is intended to do exactly what you need here.