Search code examples
pythontensorflowgeneratortensorflow-datasets

Tensorflow dataset from lots of .npy files


I'm trying to create a tensorflow dataset from 6500 .npy files of shape [256,256].

My previous method (for less files) is to load them and stack them into an np.array, and the use tf.data.Dataset.from_tensor_slices((stacked_data)).

With the current number of files I get ValueError: Cannot create a tensor proto whose content is larger than 2GB.

I'm now trying the following:

def data_generator():    
   processed = [] 
   for i in range(len(onlyfiles)):
      processed.append(tf.convert_to_tensor(np.load(onlyfiles[i], mmap_mode='r')))
                yield iter(tf.concat(processed, 0))

_dataset = tf.data.Dataset.from_generator(generator=data_generator,output_types=tf.float32)

onlyfiles is the list of the filenames

I get multiple errors, one of which is the following:

2022-10-01 11:25:44.602505: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: TypeError: `generator` yielded an element that could not be converted to the expected type. The expected type was float32, but the yielded element was <generator object Tensor.__iter__ at 0x7fe6d7d506d0>.
Traceback (most recent call last):

  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 653, in generator_py_func
    ret_arrays.append(script_ops.FuncRegistry._convert(  # pylint: disable=protected-access

  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/ops/script_ops.py", line 195, in _convert
    result = np.asarray(value, dtype=dtype, order="C")

TypeError: float() argument must be a string or a number, not 'generator'

What should I change? Is there another method to do it?

Because I created the dataset, is there a better way to prepare it for the Tensorflow implementation?


After a few days, I found this solution. I don't know how good it it, but I'll post it just in case someone finds it useful:

@tf.function
def input_fn():
    tf.compat.v1.enable_eager_execution()
    mypath = 'tensorflow_datasets/Dataset_1/'
    list_of_file_names = [join(mypath, f) for f in listdir(mypath) if isfile(join(mypath, f))]

    def gen():
        for i in itertools.count(1):
            data1 = np.load(list_of_file_names[i%len(list_of_file_names)])
            data2 = np.where(data1 > 1, data1, 1)
            yield tf.convert_to_tensor(np.where(data2>0, 20*np.log10(data2), 0))

    dataset = tf.data.Dataset.from_generator(gen, (tf.float32))

    return dataset.make_one_shot_iterator().get_next()

Solution

  • I usually do such things as follows

    dataset = tf.data.Dataset.from_tensor_slices(list_of_file_names)
    
    # Optional
    dataset = dataset.repeat().shuffle(...)
    
    def read_file(file_name):
       full_path_to_image_file = ... # build full path
       buffer = tf.io.read_file(full_path_to_image_file)
       tensor = ... # converte from buffer to tensor
       return tensor
    
    dataset = dataset.map(read_file, num_parallel_calls=...)
    

    As an option you can read file with np.load inside py_function (use decode ("utf-8") to convert byte string to ordinary python string) like

    def read_file(file_path):
        tensor = tf.py_function(
            func=lambda path: np.load(path.numpy().decode("utf-8")),
            inp=[file_path],
            Tout=tf.float32
        )
        tensor.set_shape(img_shape)
        return tensor