Is there an optimal number of elements for a tfrecords file?

This is follow up to these SO questions

What is the need to do sharding of TFRecords files?

and this passage from this tutorial

For this small dataset we will just create one TFRecords file for the training-set and another for the test-set. But if your dataset is very large then you can split it into several TFRecords files called shards. This will also improve the random shuffling, because the Dataset API only shuffles from a smaller buffer of e.g. 1024 elements loaded into RAM. So if you have e.g. 100 TFRecords files, then the randomization will be much better than for a single TFRecords file.

https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/18_TFRecords_Dataset_API.ipynb

So there is an optimal file size, but I am wondering, if there's an optimal number of elements? Since it's the elements itself that's being distributed to the GPUs cores?

Solution

Are you trying to optimize:

1 initial data randomization? 2 data randomization across training batches and/or epochs? 3 training/validation throughput (ie, gpu utilization)?

Initial data randomization should be handled when data are initially saved into sharded files. This can be challenging, assuming you can't read the data into memory. One approach is to read all the unique data ids into memory, shuffle those, do your train/validate/test split, and then write your actual data to file shards in that randomized order. Now your data are initially shuffled/split/sharded.

Initial data randomization will make it easier to maintain randomization during training. However, I'd still say it is 'best practice' to re-shuffle file names and re-shuffle a data memory buffer as part of the train/validate data streams. Typically, you'll set up an input stream using multiple threads/processes. The first step is to randomize the file input streams by re-shuffling the filenames. This can be done like:

train_files = tf.data.Dataset.list_files('{}/d*.tfr'.format(train_dir),
                                            shuffle=True)

Now, if your initial data write was already randomized, you 'could' read the entire data from one file, before going to the next, but that would still impact re-randomization throughout the training process, so typically you interleave file reads, reading a certain number of records from each file. This also improves throughput, assuming you are using multiple file read processes (which you should do, to maximize gpu throughput).

blocksize  = 1000     # samples read from one file before switching files
train_data  = train_files.interleave(interleaveFiles,
              block_length=blocksize,
              num_parallel_calls=tf.data.experimental.AUTOTUNE)

Here, we're reading 1000 samples from each file, before going on to the next. Again, to re-shuffle the training data each epoch (which may or may not be critical), we re-shuffle the data in memory, setting a memory buffer based on what's available on the machine and how large our data items are (note - before formatting the data for gpu).

buffersize = 1000000  # samples read before shuffling in memory
train_data  = train_data.shuffle(buffersize,
                                 reshuffle_each_iteration=True)
train_data  = train_data.repeat()

The repeat() call is just to allow the data set to 'wrap around' during training. This may or may not be important, depending on how you set up your training process.

To optimize throughput, you can do 2 things:

1 alter the order of operations in the data input stream. Typically, if you put your randomization operations early, they can operate on 'low weight' entities, like file names, rather than on tensors. 2 use pre-fetching to let your cpu processes stream data during gpu calculations

train_data  = train_data.map(mapData,
                  num_parallel_calls=tf.data.experimental.AUTOTUNE)
train_data  = train_data.padded_batch(batchsize)
train_data  = train_data.prefetch(10)

So, mapping and batching happens last (this is usually preferred for maximizing gpu throughput, but it can depend on other factors, like data size (pre and post-tensorizing), and how computationally expensive your map function is).

Finally, you can tune the prefetch size to maximize gpu throughput, constrained by system memory and memory speed.

So, how does this all impact the 'optimal' number of data items in each sharded file?

Obviously, if your data/file size is > your blocksize, blocksize becomes irrelevant, and you might as well read each file completely. Typically, if you are going to use this paradigm, you wand blocksize << data/file. I use 10x; so if my blocksize is 1000, I have ~10,000 data items in the file. This may not be optimal, but so far I can maintain >90% gpu usage using this approach on my specific hardware. If you want to tune for your hardware, you could start somewhere at ~10x and adjust, based on whatever you are specifically trying to optimize.

If you have very large numbers of files, you may run into problems maintaining good file read streams, but on a modern system you should be able to get to 100,000 files or more and still be fine. Moving large numbers of files around can be difficult, but usually easier than having very small numbers of very big files, so there are some (broad) constraints on file sizes that can impact how many data items/file you end up with. Generally speaking, I'd say having on the order of 100s of files would be ideal for a large dataset. That way you can easily stream files across a network efficiently (again, that will depend on your network). If the data set is small, you'll have 10s to 50s of files, which is fine for streaming, depending on file size (I typically try to hit 100-300MB/file, which works well for moving things around a LAN or WAN).

So, I think file-size and number-of-files places much stronger constraints on your process than number of data items/file, so long as you have an appropriate number of data items/file, given your file read blocksize. Again, you could hyper-shard your files (1 data item/file?), and read entire files into memory, without using file blocking. That might work, and it would certainly be lightweight to shuffle file names, rather than data items. But you might also end up with millions of files!

To really optimize, you'll need to set up an end-to-end training system on a particular machine, and then tweak it to see what works best for your particular data, network, and hardware. So long as your data are effectively randomized and your data files are easy to store/use/share, you just want to optimize gpu throughput. I would be surprised if reordering the data input stream and pre-fetching doesn't get you there.