Search code examples
pythontensorflowtensorflow-datasets

Quick Question: shuffling data in practice with shuffle_files in tfds.load


When calling shuffle_files in the latest version of TF with tfds.load, if the loaded dataset like imagenet (Split into 1024 different files I think), is called like:

tfds.load(name = 'imagenet', shuffle_files = True)

This will shuffle the different files, but not the actual images in each 1024 files. Is any reason this is done in practice? Is it the same reason why you'd usually shuffle a set of 100 images before feeding it into a NN?

Thank you!


Solution

  • I think you're talking about 'imagenet2012' so your code should be:

    ds = tfds.load('imagenet2012', split='train', shuffle_files=True)
    

    if it is imagenet you mean, you need to see this page load imagenet

    Here the argument shuffle_files will shuffle the files when loading by batch. But you should also shuffle the dataset. Here a tutorial on how the shuffle of dataset works shuffle_repeat_explained Also here you can find how the shuffle_files make the perormances better shuffle and training