Search code examples
tensorflowdatasettfrecord

Should a TFRecord contain multiple observations or one?


I see in explanation a TFRecord contains multiple classes and multiple images (a cat and a bridge). When it was written, both images are written into one TFRecord. During the read back, it is verified that this TFRecord contains two images.

Elsewhere I have seen people generating one TFRecord per image, I know you can load multiple TFRecord files like this:

train_dataset = tf.data.TFRecordDataset("<Path>/*.tfrecord")

But which way is recommended? should I build one tfrecord per image, or one tfrecord for multiple images? If put multiple images into one tfrecord, then how many is maximum?


Solution

  • As you said, it is possible to save an arbitrary amount of entries in a single TFRecord file, and one can create as many TFRecord files as desired.

    I would recommend using practical considerations to decide how to proceed:

    • On one hand, try to use fewer TFRecord files for easier handling moving files in the filesystem
    • On the other hand, avoid growing TFRecord files to a size that can become a problem for filesystem
    • Keep in mind that it is useful to keep separate TFRecord files for train / validation / test split
    • Sometimes the nature of the dataset makes it obvious how to split into separate files (for example, I have a video dataset where I use one TFRecord file per participant session)