Search code examples
pythonsplitdirectorydataset

Randomly splitting 1 file from many files based on ID


In my dataset, I have a large number of images in jpg format and they are named [ID]_[Cam]_[Frame].jpg. The dataset contains many IDs, and every ID has a different number of image. I want to randomly take 1 image from each ID into a different set of images. The problem is that the IDs in the dataset aren't always in order (Sometimes jump and skipped some numbers). As for the example below, the set of files doesn't have ID number 2 and 3.

Is there any python code to do this?


Before

  • TrainSet
    • 00000000_0001_00000000.jpg
    • 00000000_0001_00000001.jpg
    • 00000000_0002_00000001.jpg
    • 00000001_0001_00000001.jpg
    • 00000001_0002_00000001.jpg
    • 00000001_0002_00000002.jpg
    • 00000004_0001_00000001.jpg
    • 00000004_0002_00000001.jpg

After

  • TrainSet

    • 00000000_0001_00000000.jpg
    • 00000000_0001_00000002.jpg
    • 00000001_0002_00000001.jpg
    • 00000001_0001_00000001.jpg
    • 00000004_0001_00000001.jpg
  • ValidationSet

    • 00000000_0001_00000001.jpg
    • 00000001_0001_00000002.jpg
    • 00000004_0001_00000002.jpg

Solution

  • In this case, I would use a dictionary with id as the key and list of the name of files with matching id as the value. Then randomly picks the array from the dict.

    import os
    from random import choice
    from pathlib import Path
    import shutil
    
    source_folder = "SOURCE_FOLDER"
    
    dest_folder = "DEST_FOLDER"
    
    dir_list = os.listdir(source_folder)
    
    ids = {}
    
    for f in dir_list:
        f_id = f.split("_")[0]
        ids[f_id] = [f, *ids.get(f_id, [])]
    
    Path(dest_folder).mkdir(parents=True, exist_ok=True)
    
    for files in ids.values():
        random_file = choice(files)
        shutil.move(
            os.path.join(source_folder, random_file), os.path.join(dest_folder, random_file)
        )
    
    

    In your case, replace SOURCE_FOLDER with TrainSet and DEST_FOLDER with ValidationSet.