python tensorflow keras image-recognition image-classification

Load Images Data for Images Classification from different subfolders

I have training images data and a csv file contains the labels of the images. The directory of my data is look like this:

Train data/
...1/
......1_1.jpg
......1_2.jpg
......1_3.jpg
...2/
......2_1.jpg
......2_2.jpg
......2_3.jpg
etc.

So, there are 3 different images in each of the subfolders which contain the image of the same person and have the same labels. My csv file has this format:

subfolder,labels
1,0
2,1
3,0
etc.

I know that there is tf.keras.preprocessing ImageDataGenerator that can read from the dataframe, but the format needed doesn't match my directory format. Any clue on how to load my images to train my model efficiently? Thanks in advance

Solution

I think this may do what you want. I created a directory called new_people. Within it I created 7 sub directories with sub directory names 1,2,3,4,5,6,7. Within each of the sub directories I place 3 image files. In the code below I first created a data frame df in the form you described for your csv file. Then in the code I created a data frame data_df with columns filepaths, labels. The filepaths column is the full file path to the image file and the labels column has the associated label of the image. I tested the code and it seems to work. The code is shown below

import os
import pandas as pd
folder=[1,2,3,4,5,6,7] # this is a list of the folders
labels=[2,3,1,0,6,4,5] # this is a list of the labels associated with each folder
Fseries=pd.Series(folder, name='folder')
Lseries=pd.Series(labels, name='labels')
df=pd.concat([Fseries, Lseries], axis=1) # this is the data frame that should be like your csv file
print (df.head(7))

the print out would be

   folder  labels
0       1       2
1       2       3
2       3       1
3       4       0
4       5       6
5       6       4
6       7       5

the rest of the code is below

sdir=r'c:\temp\new_people' # main directory where class sub directories are present
filepaths=[]
labels=[]
class_list=os.listdir(sdir) # list of class sub directories
for klass in class_list: # iterate over the class subdirectories
    class_path=os.path.join(sdir,klass)   # path to class sub directory 
    for i in range(len(df)):  # iterate through the data set      
        if str(df['folder'].iloc[i] )== klass:  #convert folder name to a string and compare to current klass          
            label=df['labels'].iloc[i] # get the associated label 
            flist=os.listdir(class_path) # get a list of all the files in the klass sub directory
            for f in flist: # iterate through the list of files
                fpath=os.path.join(class_path,f) # get the full path to the file
                filepaths.append(fpath) # append the full file path
                labels.append(str(label))    # append the label as a string            
Fseries=pd.Series(filepaths, name='filepaths')
Lseries=pd.Series(labels, name='labels')
data_df=pd.concat([Fseries, Lseries], axis=1) # create data frame with columns filepaths, labels
print(data_df.head(28))
# Now data_df can be partitioned into a train_df, a test_df and a valid_df using train_test_split

the print out of the resultant data_df data frame is

                        filepaths  labels
0   c:\temp\new_people\1\0001.jpg       2
1   c:\temp\new_people\1\0002.jpg       2
2   c:\temp\new_people\1\0003.jpg       2
3   c:\temp\new_people\2\0004.jpg       3
4   c:\temp\new_people\2\0005.jpg       3
5   c:\temp\new_people\2\0006.jpg       3
6   c:\temp\new_people\3\0007.jpg       1
7   c:\temp\new_people\3\0008.jpg       1
8   c:\temp\new_people\3\0009.jpg       1
9   c:\temp\new_people\4\0010.jpg       0
10  c:\temp\new_people\4\0011.jpg       0
11  c:\temp\new_people\4\0012.jpg       0
12  c:\temp\new_people\5\0013.jpg       6
13  c:\temp\new_people\5\0014.jpg       6
14  c:\temp\new_people\5\0015.jpg       6
15  c:\temp\new_people\6\0016.jpg       4
16  c:\temp\new_people\6\0017.jpg       4
17  c:\temp\new_people\6\0018.jpg       4
18  c:\temp\new_people\7\0019.jpg       5
19  c:\temp\new_people\7\0020.jpg       5
20  c:\temp\new_people\7\0021.jpg       5

The data frame appears to correctly reflect the folder labels in the df data frame. The data_df data frame can now be used with train_test_split to create a train_df, a test_df and a valid_df. These then can be used with ImageDataGeneratory.flow_from_dataframe to create a train_generator, a test_generator and a valid_generator for use with model.fit and model.evaluate or model.predict. If you need help on how to do that let me know.