Search code examples

Using Datasets from large numpy arrays in Tensorflow

I'm trying to load a dataset, stored in two .npy files (for features and ground truth) on my drive, and use it to train a neural network.

print("loading features...")
data = np.load("[...]/features.npy")

print("loading labels...")
labels = np.load("[...]/groundtruth.npy") / 255

dataset =, labels))

throws a tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized. error when calling the from_tensor_slices() method.

The ground truth's file is larger than 2.44GB and thus I encounter problems when creating a Dataset with it (see warnings here and here).

Possible solutions I found were either for TensorFlow 1.x (here and here, while I am running version 2.6) or to use numpy's memmap (here), which I unfortunately don't get to run, plus I wonder whether that slows down the computation?

I'd appreciate your help, thanks!


  • You need some kind of data generator, because your data is way too big to fit directly into I don't have your dataset, but here's an example of how you could get data batches and train your model inside a custom training loop. The data is an NPZ NumPy archive from here:

    import numpy as np
    def load_data(file='dsprites_ndarray_co1sh3sc6or40x32y32_64x64.npz'):
        dataset_zip = np.load(file, encoding='latin1')
        images = dataset_zip['imgs']
        latents_classes = dataset_zip['latents_classes']
        return images, latents_classes
    def get_batch(indices, train_images, train_categories):
        shapes_as_categories = np.array([train_categories[i][1] for i in indices])
        images = np.array([train_images[i] for i in indices])
        return [images.reshape((images.shape[0], 64, 64, 1)).astype('float32'), shapes_as_categories.reshape(
            shapes_as_categories.shape[0], 1).astype('float32')]
    # Load your data once
    train_images, train_categories = load_data()
    indices = list(range(train_images.shape[0]))
    epochs = 2000
    batch_size = 256
    total_batch = train_images.shape[0] // batch_size
    for epoch in range(epochs):
        for i in range(total_batch):
            batch_indices = indices[batch_size * i: batch_size * (i + 1)]
            batch = get_batch(batch_indices, train_images, train_categories)
            # Train your model with this batch.