Search code examples

How to create a tensorflow dataset from a list of filenames that need to be loaded and transformed and their corresponding labels

Given a list of npy filenames x:

x = ['path/to/file1.npy', 'path/to/file2.npy']

and a list of labels y:

y = [1, 0]

I want to create a tensorflow Dataset that consists of pairs of the labels and the loaded and transformed numpy arrays contained within the npy files.


  1. Each npy file must be loaded, the numpy array contained within must undergo an arbitrary transformation (irrelevant to the question) and then the array must be finally added to the Dataset along with its corresponding label.
  2. It is necessary to use a Dataset as the files are too large to be loaded into the memory at once.
  3. The npy files are not all contained in a single directory.

Existing answers and how they don't match for my case:


  • Answering to the comment of the question, you need a tf.py_function wrapper to use non-Tensorflow functions. You can't use non-TF functions directly in the .map method. (most code comes from this question):

    def load_files_py(train_filenames, width, height):
       image = np.load(train_filenames)
       image = skimage.transform.resize(array, (height, width))
       return image
    def parse_function(image_filenames, labels):
        image = tf.py_function(load_files_py, inp=[image_filenames, width, height], Tout=[tf.float32])
        return image, label
    dataset =, labels))
    dataset =, num_parallel_calls=PARALLEL_CALLS)

    You most likely meant sklearn.transform.resize. does not exist. If you want o preserve the number of channels, you don't give it as an argument.
    Note that num_parallel_calls could be useless here, because tf.py_function acquires the python GIL (global interpreter lock), which prevents multithreading.

    You could switch out sklearn.transform.resize and use tf.image.resize, or tf.keras.layers.Resize directly in the model. They are practically the same.