tensorflow optimization training-data tfrecord

Tfrecord vs TF.image?

I was under the impression that having a pre-computed Tfrecord file was the most efficient way to feed your input function. However, I keep seeing great looking articles such as this one where the input function takes a reference to a raw file on disk, and does the decoding on the spot.

Is there a benefit in creating Tfrecord files, or is it just as efficient to decode and prepare each sample right inside the input function (as opposed to having the input function simply decode the Tfrecord)?
When using direct raw files in the input function as in the example above, where would you add the data augmentation step?

The way I've done this in the past is that I'd have a separate script that, given a reference to some files, it would generate a Tfrecord file with the data augmentation as part of it. For example, the first n images in the Tfrecord were a given image, followed by random transformations of it, etc. Then the input function simply decoded each record and specified the batching, shuffling, etc.

Solution

You probably have this impression because this input format is put forward on the tensorflow website, where it is designated as the “recommended format” , or even the “standard TensorFlow format”.

In my opinion, the main benefits of the TFRecord format are that

it get first-class citizen support from tensorflow, with dedicated functions to read and decode it,
it is a flexible format that can store together several different categories of data, not just an image,
it can store more than one record,
it is portable.

However the format itself, based on protobuf, has not been designed for performance first. For examples, labels are stored in plain text and are repeated for each record -- consequently, TFRecord files may end up being much larger than plain-text csv files. The way numerical values are stored is also not designed for performance: the number of bits used to encode a value does not necessarily match the input type (e.g. a uint8 may be stored using one or two bytes depending on its value); worse, negative integer values are stored using 10 (!) bytes no matter what.

In my experience, TFRecord files have never provided a performance boost to my input pipeline -- at best, they have been on par with raw data, most of the time they result in slightly worst performance. On the other hand, the format is largely unkown outside of tensorflow, and even within tensorflow you need scratch your head a bit just to read a single record to debug it.

So unless you strive for portability, you can work on raw binary data without fearing of missing much; if your files are very small however, consider grouping several samples in a single file for performance, or using something more elaborate such as HDF5. (If portability is an issue, then I would still consider benchmarking against HDF5, which is also portable).

Lastly, do not to take my word for granted and benchmark formats for your problem. The advantage of TFRecord being put forward by the dev team is that you will find many examples on how to use it, starting with converting data to this format.