Search code examples
pythontensorflowkerasdeep-learningdataset

Too much RAM is required for loading dataset


I’m working in a neural network and my dataset has 42000 images and I have to load it all. I’m using google colab for that, but every time I load the dataset the RAM is insufficient.

I am putting everything in a numpy array, cause I tried to use the ImageGenerator method and it didn’t work. I’m using the following code to load the data:

class = glob.glob(r"/content/drive/MyDrive/DATASET/class/*.*")


data = []
labels = []

for i in class:   
    image=tf.keras.preprocessing.image.load_img(i, color_mode='rgb', 
    target_size= (336, 336))
    image=np.array(image)
    data.append(image)
    labels.append(0)

data = np.array(data)
labels = np.array(labels)

Solution

  • As ImageDataGenerator is deprecated, you can use a custom Keras Sequence class to load images when needed.

    The strategy here is to create a Pandas DataFrame with all the path and class of your images then transform the class to numeric label with pd.factorize. Once, you have X (paths) and y (labels), you can use train_test_split to extract 3 subsets: train, test and validation. The last step is to convert these collections to datasets compatible with Tensorflow.

    Each time, Tensorflow process a batch, the Sequence will load a batch of images in memory and so on.

    Step 0: Imports and constants

    import tensorflow as tf
    import pandas as pd
    import numpy as np
    import pathlib
    from sklearn.model_selection import train_test_split
    
    INPUT_SHAPE = (336, 336, 3)
    BATCH_SIZE = 32
    
    DATA_DIR = pathlib.Path('/content/drive/MyDrive/DATASET/')
    

    Step 1: Load all image paths to a Pandas DataFrame:

    # Find images of dataset
    data = []
    for file in DATA_DIR.glob('**/*.jpg'):
        d = {'class': file.parent.name,
             'path': file}
        data.append(d)
    
    # Create dataframe and select columns
    df = pd.DataFrame(data)
    df['label'] = pd.factorize(df['class'])[0]
    X = df['path']
    y = df['label']
    
    # Split into 3 balanced datasets 
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.2, random_state=2023)
    X_train, X_valid, y_train, y_valid = \
        train_test_split(X_train, y_train, test_size=0.2, random_state=2023)
    

    Step 2: Create a custom data Sequence

    class ImgDataSequence(tf.keras.utils.Sequence):
        """
        Check documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence
        """
    
        def __init__(self, image_set, label_set, batch_size=32, image_size=(256, 256)):
            self.image_set = np.array(image_set)
            self.label_set = np.array(label_set)
            self.batch_size = batch_size
            self.image_size = image_size
    
        def __get_image(self, image):
            image = tf.keras.preprocessing.image.load_img(image, color_mode='rgb', target_size=self.image_size)
            image_arr = tf.keras.preprocessing.image.img_to_array(image)
            return image_arr
    
        def __get_data(self, images, labels):
            image_batch = np.asarray([self.__get_image(img) for img in images])
            label_batch = np.asarray(labels)
            return image_batch, label_batch
    
        def __getitem__(self, index):
            images = self.image_set[index * self.batch_size:(index + 1) * self.batch_size]
            labels = self.label_set[index * self.batch_size:(index + 1) * self.batch_size]
            images, labels = self.__get_data(images, labels)
            return images, labels
    
        def __len__(self):
            return len(self.image_set) // self.batch_size + (len(self.image_set) % self.batch_size > 0)
    

    Step 3: Create datasets

    train_ds = ImgDataSequence(X_train, y_train, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
    valid_ds = ImgDataSequence(X_valid, y_valid, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
    test_ds = ImgDataSequence(X_test, y_test, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
    

    Test the new datasets:

    # Take the first batch of our train dataset
    >>> imgs, labels = train_ds[0]
    
    # Check then length (BATCH_SIZE)
    >>> len(labels)
    32
    
    # Check the dimension of one image
    >>> imgs[0].shape
    (336, 336, 3)
    

    How to use it with Tensorflow?

    # train_ds & valid_ds to fit
    history = model.fit(train_ds, epochs=10, validation_data=valid_ds)
    
    # test_ds to evaluate
    loss, *metrics = model.evaluate(test_ds)