Search code examples
pythontensorflowfor-loopkerastensorflow-datasets

TensorFlow Dataset: Order appears randomised when iterating via For loop?


I am creating some batch TensorFlow datasets tf.keras.preprocessing.image_dataset_from_directory:

image_size = (90, 120)
batch_size = 32

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'train'),
    validation_split=0.25,
    subset="training",
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'train'),
    validation_split=0.25,
    subset="validation",
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'test'),
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)

If I then use the following for loop to get image and label information from one of the datasets, I get different outputs each time I run it:

for images, labels in test_ds:
  print(labels)

For instance, the first batch will appear like this in one run:

tf.Tensor([0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1], shape=(32,), dtype=int32)

But then be completely different when the loop is run again;

tf.Tensor([1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0], shape=(32,), dtype=int32)

How can the order be different every time I loop over it? Are TensorFlow datasets unordered? From what I've found, they are supposed to be ordered, so I have no idea why the for loop returns the labels in different orders each time.

Any insight regarding this would be much appreciated.

UPDATE: The shuffling of the order of the dataset is working as intended. For my test data, I just need to set shuffle to False. Many thanks @AloneTogether !


Solution

  • The parameter shuffle of tf.keras.preprocessing.image_dataset_from_directory is set to True by default, if you want deterministic results, maybe try setting it to False:

    import tensorflow as tf
    import pathlib
    
    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
    data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
    data_dir = pathlib.Path(data_dir)
    
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.2,
      subset="training",
      image_size=(28, 28),
      batch_size=5,
      shuffle=False)
    
    for x, y in train_ds:
      print(y)
      break
    

    This, on the other hand, will always yield random results:

    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      seed=None,
      image_size=(28, 28),
      batch_size=5,
      shuffle=True)
    
    for x, y in train_ds:
      print(y)
      break
    

    If you set a random seed and shuffle=True, the dataset will be shuffled once but you will have deterministic results:

    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      seed=123,
      image_size=(28, 28),
      batch_size=5,
      shuffle=True)
    
    for x, y in train_ds:
      print(y)
      break