Search code examples
pythonextractfilenames

Extract details in multiple image filenames in Python and add them as labels for a dataset


I have a folder containing 1300 .JPEG files all of which have filenames in a specific order.

The order of each file name is category_count_randomString.JPEG. To give an example, below is one image from the folder:

13_2_5jdf.JPEG where 13 is the category, 2 is the count of that category in the image, followed by the random string.

I'd like to be able to:

  1. extract both the category from each filename and assign them as labels (to then build a CNN model) and
  2. extract the count of the category from each filename and also assign them to a vector/array.

For now, I've just loaded the images (not yet as an array) using the glob function.

import glob

data = '/Users/Data'

images = glob.glob(data+'/*.JPEG')

I'm new to coding and so I'm looking for someone to be able to provide 'idiot-proof' lines of coding that I can just incorporate into my notebook to make this work.


Solution

  • You can use os to get a list of all your files in your data directory and the split command to get at the information in your filename:

    import os
    
    data_path = "/Users/Data"
    
    categories = []
    counts = []
    rand_strs = []
    
    for img_filename in os.listdir(data_path):
        if img_filename.endswith(".JPEG"):
            category, count, rand_str = img_filename.split('.')[0].split('_')
            categories.append(category)
            counts.append(int(count))
            rand_strs.append(rand_str)
    

    Each list is then indexed the same, so for example if you wanted to know how many counts you have for category 13, you can do

    category_idx = categories.index('13')
    print "Category %s has %d elements" % (categories[category_idx], counts[category_idx])