Search code examples
pythontensorflowobject-detection

Can Tensorflow shuffle multiple sharded TFrecord binaries for object detection training?


I am trying to train a FasterRCNN model with the Object Detection API.

I have a dataset of 5 classes (truck, car, van, boat, and bike), with about 1000 images each. Each class has its own TFrecord file, sharded into 10 pieces. This gives me a total of 50 files, which look something like this:

  • truck_train.record-00000-of-00010
  • car_train.record-00000-of-00010
  • van_train.record-00000-of-00010
  • boat_train.record-00000-of-00010
  • bike_train.record-00000-of-00010

Can I configure my training pipeline such that Tensorflow opens and shuffles the contents of these files randomly?

I am aware that I could simply re-generate the TFrecord files from scratch and mix my data that way, but my intent here is to be able to add classes to my dataset by simply adding the TFrecord files containing a new class.

Having read this older answer on shuffling, I wonder if there is a built-in way that Tensorflow could implement a shuffle queue, even if it means splitting my TFrecords files into 100 shards instead of 10.

I am using a modified sample .config file for FasterRCNN, but I envision issues if Tensorflow opens only one .record file at a time, as each file contains only a single class.

I am aware that the tf_record_input_reader can receive a list of files:

train_input_reader: {
  tf_record_input_reader {
    input_path: ["Datasets\train-1.record", "Datasets\train-2.record"]
  }

By increasing the size of the shuffle buffer and num_readers of input readers, will train.py have sufficient randomization of data?


Solution

  • Such a config should be fine:

    train_input_reader: {
      tf_record_input_reader {
        input_path: "Datasets\train-1.record"
        input_path: "Datasets\train-2.record"
        ...
        input_path: "Datasets\train-10.record"
      }
      shuffle: True
    }
    

    Or simply:

    train_input_reader: {
      tf_record_input_reader {
        input_path: "Datasets\*.record"
      }
      shuffle: True
    }
    

    However, the default value for shuffle is anyway True, so it is only for verbosity.

    The default value for num_readers is 64 and for filenames_shuffle_buffer_size is 100, so for the 50 files you have it must be enough.