Search code examples
pythontensorflowkerastf.data.dataset

tf.data.Dataset.from_generator takes only first 256 elements


I am using the from_generator function in tf.data.Dataset to load my data of 9000 samples, but it takes only the first 256 elements and repeats them to fill 9000 samples.

def gen():
  for idx in z:
    yield idx

z = list(range(9000))  # 9000 is the length of my dataset
dataset = tf.data.Dataset.from_generator(gen, tf.uint8)

for step, sample in enumerate(dataset):
  print(step)  
  print(sample)

Expected behavior:

0
tf.Tensor(0, shape=(), dtype=uint8)

...

8999
tf.Tensor(8999, shape=(), dtype=uint8)

Actual behavior:

0
tf.Tensor(0, shape=(), dtype=uint8)
1
tf.Tensor(1, shape=(), dtype=uint8)

...

255
tf.Tensor(255, shape=(), dtype=uint8)
256
tf.Tensor(0, shape=(), dtype=uint8)

...

I feel like I filled a sort of buffer of length 256, but I am not sure. Would appreciate any help!


Solution

  • With tensors of dtype.uint8, you store 8 bit integers. The highest value you can encode in an unsigned eight bit integer number is 2⁸ - 1 = 255. If you put in a higher number it simply overflows and starts again at 0.

    To find an appropriate dtype, make sure that the largest and smallest number you have in your dataset are representable in the respective data type. In most cases tf.dtypes.int32 will be a good choice or tf.dtypes.uint32 if it is important that your integers are unsigned.

    You might also want to notice that the output_types argument of the tf.data.Dataset.from_generator method that you are using has been deprecated. The TensorFlow API documentation recommeds to use output_signature instead:

    dataset = tf.data.Dataset.from_generator(gen,
                              output_signature=(tf.TensorSpec(shape=(), dtype=tf.int32)))