python numpy tensorflow dataset tensorflow-datasets

Using Datasets from large numpy arrays in Tensorflow

I'm trying to load a dataset, stored in two .npy files (for features and ground truth) on my drive, and use it to train a neural network.

print("loading features...")
data = np.load("[...]/features.npy")

print("loading labels...")
labels = np.load("[...]/groundtruth.npy") / 255

dataset = tf.data.Dataset.from_tensor_slices((data, labels))

throws a tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized. error when calling the from_tensor_slices() method.

The ground truth's file is larger than 2.44GB and thus I encounter problems when creating a Dataset with it (see warnings here and here).

Possible solutions I found were either for TensorFlow 1.x (here and here, while I am running version 2.6) or to use numpy's memmap (here), which I unfortunately don't get to run, plus I wonder whether that slows down the computation?

I'd appreciate your help, thanks!

Solution

You need some kind of data generator, because your data is way too big to fit directly into tf.data.Dataset.from_tensor_slices. I don't have your dataset, but here's an example of how you could get data batches and train your model inside a custom training loop. The data is an NPZ NumPy archive from here:

import numpy as np

def load_data(file='dsprites_ndarray_co1sh3sc6or40x32y32_64x64.npz'):
    dataset_zip = np.load(file, encoding='latin1')

    images = dataset_zip['imgs']
    latents_classes = dataset_zip['latents_classes']

    return images, latents_classes

def get_batch(indices, train_images, train_categories):
    shapes_as_categories = np.array([train_categories[i][1] for i in indices])
    images = np.array([train_images[i] for i in indices])

    return [images.reshape((images.shape[0], 64, 64, 1)).astype('float32'), shapes_as_categories.reshape(
        shapes_as_categories.shape[0], 1).astype('float32')]

# Load your data once
train_images, train_categories = load_data()
indices = list(range(train_images.shape[0]))
random.shuffle(indices)

epochs = 2000
batch_size = 256
total_batch = train_images.shape[0] // batch_size

for epoch in range(epochs):
    for i in range(total_batch):
        batch_indices = indices[batch_size * i: batch_size * (i + 1)]
        batch = get_batch(batch_indices, train_images, train_categories)
        ...
        ...
        # Train your model with this batch.