Search code examples
pythontensorflowkerasimage-recognitionimage-classification

Load Images Data for Images Classification from different subfolders


I have training images data and a csv file contains the labels of the images. The directory of my data is look like this:

Train data/
...1/
......1_1.jpg
......1_2.jpg
......1_3.jpg
...2/
......2_1.jpg
......2_2.jpg
......2_3.jpg
etc.

So, there are 3 different images in each of the subfolders which contain the image of the same person and have the same labels. My csv file has this format:

subfolder,labels
1,0
2,1
3,0
etc.

I know that there is tf.keras.preprocessing ImageDataGenerator that can read from the dataframe, but the format needed doesn't match my directory format. Any clue on how to load my images to train my model efficiently? Thanks in advance


Solution

  • I think this may do what you want. I created a directory called new_people. Within it I created 7 sub directories with sub directory names 1,2,3,4,5,6,7. Within each of the sub directories I place 3 image files. In the code below I first created a data frame df in the form you described for your csv file. Then in the code I created a data frame data_df with columns filepaths, labels. The filepaths column is the full file path to the image file and the labels column has the associated label of the image. I tested the code and it seems to work. The code is shown below

    import os
    import pandas as pd
    folder=[1,2,3,4,5,6,7] # this is a list of the folders
    labels=[2,3,1,0,6,4,5] # this is a list of the labels associated with each folder
    Fseries=pd.Series(folder, name='folder')
    Lseries=pd.Series(labels, name='labels')
    df=pd.concat([Fseries, Lseries], axis=1) # this is the data frame that should be like your csv file
    print (df.head(7))
    

    the print out would be

       folder  labels
    0       1       2
    1       2       3
    2       3       1
    3       4       0
    4       5       6
    5       6       4
    6       7       5
    

    the rest of the code is below

    sdir=r'c:\temp\new_people' # main directory where class sub directories are present
    filepaths=[]
    labels=[]
    class_list=os.listdir(sdir) # list of class sub directories
    for klass in class_list: # iterate over the class subdirectories
        class_path=os.path.join(sdir,klass)   # path to class sub directory 
        for i in range(len(df)):  # iterate through the data set      
            if str(df['folder'].iloc[i] )== klass:  #convert folder name to a string and compare to current klass          
                label=df['labels'].iloc[i] # get the associated label 
                flist=os.listdir(class_path) # get a list of all the files in the klass sub directory
                for f in flist: # iterate through the list of files
                    fpath=os.path.join(class_path,f) # get the full path to the file
                    filepaths.append(fpath) # append the full file path
                    labels.append(str(label))    # append the label as a string            
    Fseries=pd.Series(filepaths, name='filepaths')
    Lseries=pd.Series(labels, name='labels')
    data_df=pd.concat([Fseries, Lseries], axis=1) # create data frame with columns filepaths, labels
    print(data_df.head(28))
    # Now data_df can be partitioned into a train_df, a test_df and a valid_df using train_test_split      
    

    the print out of the resultant data_df data frame is

                            filepaths  labels
    0   c:\temp\new_people\1\0001.jpg       2
    1   c:\temp\new_people\1\0002.jpg       2
    2   c:\temp\new_people\1\0003.jpg       2
    3   c:\temp\new_people\2\0004.jpg       3
    4   c:\temp\new_people\2\0005.jpg       3
    5   c:\temp\new_people\2\0006.jpg       3
    6   c:\temp\new_people\3\0007.jpg       1
    7   c:\temp\new_people\3\0008.jpg       1
    8   c:\temp\new_people\3\0009.jpg       1
    9   c:\temp\new_people\4\0010.jpg       0
    10  c:\temp\new_people\4\0011.jpg       0
    11  c:\temp\new_people\4\0012.jpg       0
    12  c:\temp\new_people\5\0013.jpg       6
    13  c:\temp\new_people\5\0014.jpg       6
    14  c:\temp\new_people\5\0015.jpg       6
    15  c:\temp\new_people\6\0016.jpg       4
    16  c:\temp\new_people\6\0017.jpg       4
    17  c:\temp\new_people\6\0018.jpg       4
    18  c:\temp\new_people\7\0019.jpg       5
    19  c:\temp\new_people\7\0020.jpg       5
    20  c:\temp\new_people\7\0021.jpg       5
    

    The data frame appears to correctly reflect the folder labels in the df data frame. The data_df data frame can now be used with train_test_split to create a train_df, a test_df and a valid_df. These then can be used with ImageDataGeneratory.flow_from_dataframe to create a train_generator, a test_generator and a valid_generator for use with model.fit and model.evaluate or model.predict. If you need help on how to do that let me know. ​