I have training images data and a csv file contains the labels of the images. The directory of my data is look like this:
Train data/
...1/
......1_1.jpg
......1_2.jpg
......1_3.jpg
...2/
......2_1.jpg
......2_2.jpg
......2_3.jpg
etc.
So, there are 3 different images in each of the subfolders which contain the image of the same person and have the same labels. My csv file has this format:
subfolder,labels
1,0
2,1
3,0
etc.
I know that there is tf.keras.preprocessing ImageDataGenerator that can read from the dataframe, but the format needed doesn't match my directory format. Any clue on how to load my images to train my model efficiently? Thanks in advance
I think this may do what you want. I created a directory called new_people. Within it I created 7 sub directories with sub directory names 1,2,3,4,5,6,7. Within each of the sub directories I place 3 image files. In the code below I first created a data frame df in the form you described for your csv file. Then in the code I created a data frame data_df with columns filepaths, labels. The filepaths column is the full file path to the image file and the labels column has the associated label of the image. I tested the code and it seems to work. The code is shown below
import os
import pandas as pd
folder=[1,2,3,4,5,6,7] # this is a list of the folders
labels=[2,3,1,0,6,4,5] # this is a list of the labels associated with each folder
Fseries=pd.Series(folder, name='folder')
Lseries=pd.Series(labels, name='labels')
df=pd.concat([Fseries, Lseries], axis=1) # this is the data frame that should be like your csv file
print (df.head(7))
the print out would be
folder labels
0 1 2
1 2 3
2 3 1
3 4 0
4 5 6
5 6 4
6 7 5
the rest of the code is below
sdir=r'c:\temp\new_people' # main directory where class sub directories are present
filepaths=[]
labels=[]
class_list=os.listdir(sdir) # list of class sub directories
for klass in class_list: # iterate over the class subdirectories
class_path=os.path.join(sdir,klass) # path to class sub directory
for i in range(len(df)): # iterate through the data set
if str(df['folder'].iloc[i] )== klass: #convert folder name to a string and compare to current klass
label=df['labels'].iloc[i] # get the associated label
flist=os.listdir(class_path) # get a list of all the files in the klass sub directory
for f in flist: # iterate through the list of files
fpath=os.path.join(class_path,f) # get the full path to the file
filepaths.append(fpath) # append the full file path
labels.append(str(label)) # append the label as a string
Fseries=pd.Series(filepaths, name='filepaths')
Lseries=pd.Series(labels, name='labels')
data_df=pd.concat([Fseries, Lseries], axis=1) # create data frame with columns filepaths, labels
print(data_df.head(28))
# Now data_df can be partitioned into a train_df, a test_df and a valid_df using train_test_split
the print out of the resultant data_df data frame is
filepaths labels
0 c:\temp\new_people\1\0001.jpg 2
1 c:\temp\new_people\1\0002.jpg 2
2 c:\temp\new_people\1\0003.jpg 2
3 c:\temp\new_people\2\0004.jpg 3
4 c:\temp\new_people\2\0005.jpg 3
5 c:\temp\new_people\2\0006.jpg 3
6 c:\temp\new_people\3\0007.jpg 1
7 c:\temp\new_people\3\0008.jpg 1
8 c:\temp\new_people\3\0009.jpg 1
9 c:\temp\new_people\4\0010.jpg 0
10 c:\temp\new_people\4\0011.jpg 0
11 c:\temp\new_people\4\0012.jpg 0
12 c:\temp\new_people\5\0013.jpg 6
13 c:\temp\new_people\5\0014.jpg 6
14 c:\temp\new_people\5\0015.jpg 6
15 c:\temp\new_people\6\0016.jpg 4
16 c:\temp\new_people\6\0017.jpg 4
17 c:\temp\new_people\6\0018.jpg 4
18 c:\temp\new_people\7\0019.jpg 5
19 c:\temp\new_people\7\0020.jpg 5
20 c:\temp\new_people\7\0021.jpg 5
The data frame appears to correctly reflect the folder labels in the df data frame. The data_df data frame can now be used with train_test_split to create a train_df, a test_df and a valid_df. These then can be used with ImageDataGeneratory.flow_from_dataframe to create a train_generator, a test_generator and a valid_generator for use with model.fit and model.evaluate or model.predict. If you need help on how to do that let me know.