Given a list of npy filenames x
:
x = ['path/to/file1.npy', 'path/to/file2.npy']
and a list of labels y
:
y = [1, 0]
I want to create a tensorflow Dataset
that consists of pairs of the labels and the loaded and transformed numpy arrays contained within the npy files.
Dataset
along with its corresponding label.Dataset
as the files are too large to be loaded into the memory at once.Tensorflow dataset from lots of .npy files
a) does not offer clear directions on how to construct the mapping function of the loading and b) focuses on a function that only handles arrays and not their corresponding labels.
What is the best way to load data with tf.data.Dataset in memory efficient way
This answer does not provide what I ask about (the mapping function to load both x
and y
, along with transformations of x) but instead has a placeholder for that function (PARSE_FUNCTION
).
Answering to the comment of the question, you need a tf.py_function
wrapper to use non-Tensorflow functions. You can't use non-TF functions directly in the .map
method. (most code comes from this question):
def load_files_py(train_filenames, width, height):
image = np.load(train_filenames)
image = skimage.transform.resize(array, (height, width))
return image
def parse_function(image_filenames, labels):
image = tf.py_function(load_files_py, inp=[image_filenames, width, height], Tout=[tf.float32])
return image, label
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(parse_function, num_parallel_calls=PARALLEL_CALLS)
You most likely meant sklearn.transform.resize
. sklearn.io.resize does not exist. If you want o preserve the number of channels, you don't give it as an argument.
Note that num_parallel_calls
could be useless here, because tf.py_function
acquires the python GIL (global interpreter lock), which prevents multithreading.
You could switch out sklearn.transform.resize
and use tf.image.resize
, or tf.keras.layers.Resize
directly in the model. They are practically the same.