Extract details in multiple image filenames in Python and add them as labels for a dataset

I have a folder containing 1300 .JPEG files all of which have filenames in a specific order.

The order of each file name is category_count_randomString.JPEG. To give an example, below is one image from the folder:

13_2_5jdf.JPEG where 13 is the category, 2 is the count of that category in the image, followed by the random string.

I'd like to be able to:

extract both the category from each filename and assign them as labels (to then build a CNN model) and
extract the count of the category from each filename and also assign them to a vector/array.

For now, I've just loaded the images (not yet as an array) using the glob function.

import glob

data = '/Users/Data'

images = glob.glob(data+'/*.JPEG')

I'm new to coding and so I'm looking for someone to be able to provide 'idiot-proof' lines of coding that I can just incorporate into my notebook to make this work.

Solution

You can use os to get a list of all your files in your data directory and the split command to get at the information in your filename:

import os

data_path = "/Users/Data"

categories = []
counts = []
rand_strs = []

for img_filename in os.listdir(data_path):
    if img_filename.endswith(".JPEG"):
        category, count, rand_str = img_filename.split('.')[0].split('_')
        categories.append(category)
        counts.append(int(count))
        rand_strs.append(rand_str)

Each list is then indexed the same, so for example if you wanted to know how many counts you have for category 13, you can do

category_idx = categories.index('13')
print "Category %s has %d elements" % (categories[category_idx], counts[category_idx])