python tensorflow keras deep-learning dataset

Too much RAM is required for loading dataset

I’m working in a neural network and my dataset has 42000 images and I have to load it all. I’m using google colab for that, but every time I load the dataset the RAM is insufficient.

I am putting everything in a numpy array, cause I tried to use the ImageGenerator method and it didn’t work. I’m using the following code to load the data:

class = glob.glob(r"/content/drive/MyDrive/DATASET/class/*.*")


data = []
labels = []

for i in class:   
    image=tf.keras.preprocessing.image.load_img(i, color_mode='rgb', 
    target_size= (336, 336))
    image=np.array(image)
    data.append(image)
    labels.append(0)

data = np.array(data)
labels = np.array(labels)

Solution

As ImageDataGenerator is deprecated, you can use a custom Keras Sequence class to load images when needed.

The strategy here is to create a Pandas DataFrame with all the path and class of your images then transform the class to numeric label with pd.factorize. Once, you have X (paths) and y (labels), you can use train_test_split to extract 3 subsets: train, test and validation. The last step is to convert these collections to datasets compatible with Tensorflow.

Each time, Tensorflow process a batch, the Sequence will load a batch of images in memory and so on.

Step 0: Imports and constants

import tensorflow as tf
import pandas as pd
import numpy as np
import pathlib
from sklearn.model_selection import train_test_split

INPUT_SHAPE = (336, 336, 3)
BATCH_SIZE = 32

DATA_DIR = pathlib.Path('/content/drive/MyDrive/DATASET/')

Step 1: Load all image paths to a Pandas DataFrame:

# Find images of dataset
data = []
for file in DATA_DIR.glob('**/*.jpg'):
    d = {'class': file.parent.name,
         'path': file}
    data.append(d)

# Create dataframe and select columns
df = pd.DataFrame(data)
df['label'] = pd.factorize(df['class'])[0]
X = df['path']
y = df['label']

# Split into 3 balanced datasets 
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, random_state=2023)
X_train, X_valid, y_train, y_valid = \
    train_test_split(X_train, y_train, test_size=0.2, random_state=2023)

Step 2: Create a custom data Sequence

class ImgDataSequence(tf.keras.utils.Sequence):
    """
    Check documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence
    """

    def __init__(self, image_set, label_set, batch_size=32, image_size=(256, 256)):
        self.image_set = np.array(image_set)
        self.label_set = np.array(label_set)
        self.batch_size = batch_size
        self.image_size = image_size

    def __get_image(self, image):
        image = tf.keras.preprocessing.image.load_img(image, color_mode='rgb', target_size=self.image_size)
        image_arr = tf.keras.preprocessing.image.img_to_array(image)
        return image_arr

    def __get_data(self, images, labels):
        image_batch = np.asarray([self.__get_image(img) for img in images])
        label_batch = np.asarray(labels)
        return image_batch, label_batch

    def __getitem__(self, index):
        images = self.image_set[index * self.batch_size:(index + 1) * self.batch_size]
        labels = self.label_set[index * self.batch_size:(index + 1) * self.batch_size]
        images, labels = self.__get_data(images, labels)
        return images, labels

    def __len__(self):
        return len(self.image_set) // self.batch_size + (len(self.image_set) % self.batch_size > 0)

Step 3: Create datasets

train_ds = ImgDataSequence(X_train, y_train, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
valid_ds = ImgDataSequence(X_valid, y_valid, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
test_ds = ImgDataSequence(X_test, y_test, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)

Test the new datasets:

# Take the first batch of our train dataset
>>> imgs, labels = train_ds[0]

# Check then length (BATCH_SIZE)
>>> len(labels)
32

# Check the dimension of one image
>>> imgs[0].shape
(336, 336, 3)

How to use it with Tensorflow?

# train_ds & valid_ds to fit
history = model.fit(train_ds, epochs=10, validation_data=valid_ds)

# test_ds to evaluate
loss, *metrics = model.evaluate(test_ds)