python tensorflow jupyter-notebook google-colaboratory image-manipulation

Load Image Dataset

I am trying to load data from a particular directory that contains more than 10M images are there and 10K classes but the problem is I don't have a different directory for all classes, all the images are in one directory only. I have a CSV file label that contains id and label. and I am trying to use the VGG16 model.

CSV:
id,lable
abf20a,CAR
dsf8sd,BIKE

here abf20a is image name "abf20a.jpg"

so please help me here how can I load images and labels together and train the model using VGG16

thanks

Vishal

Solution

You can use ImageDataGenerator's flow_from_dataframe method to load the images using a CSV file.
Code:

import tensorflow as tf
import pandas as pd

df = pd.read_csv('data/img/new.csv')

# Data augmentation pipeline
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator()

# Reading files from path in data frame
train_ds = train_datagen.flow_from_dataframe(df,directory = 'data/img/new', x_col = 'filename', y_col = 'label')

The Dataframe looks like this:

    filename    label
0   Capture.PNG 0

If you just have id in your filename. You can use pandas apply method to add jpg extension.

df['id'] = df['id'].apply(lambda x: '{}.jpg'.format(x))

For a complete set of data augmentation options provided by ImageDataGenerator, you can look at this.

For a complete set of options for flow_from_dataframe, you can look at this.

With this, you don't have to worry about mismatching labels as this is an inbuilt TensorFlow method. Also, the files are loaded as and when necessary which avoids cluttering your main memory.

For training you can simply use:

model.fit(
        train_ds,
        steps_per_epoch=2000,
        epochs=50,
        validation_data=validation_ds,
        validation_steps=800)