Search code examples

TensorFlow Object Detection API augmentations

I'm curious about the order of resizing and augmentations in the TensorFlow object detection API. For example, I'm using the config file ssd_mobilenet_v2_oid_v4.config. This uses fixed_shape_resizer and ssd_random_crop. So what is the interaction between these two modules?

Does the ssd_random_crop take a crop of the size defined in fixed_shape_resizer? If resizing happens first, then what size are the crops after resizing? And I assume they all need to be the same exact size in order to create proper batches?


  • Data augmentation happens before resizing. All preprocessing steps are specified in function transform_input_data in file, this file contains functions like create_train_input_fn, create_eval_input_fn and create_predict_input_fn that will feed input image tensors to the model during training, evaluation and prediction. In create_train_input_fn, the following transform function is used.

    def transform_input_data(tensor_dict,
      """A single function that is responsible for all input data transformations.
      Data transformation functions are applied in the following order.
      1. If key fields.InputDataFields.image_additional_channels is present in
         tensor_dict, the additional channels will be merged into
      2. data_augmentation_fn (optional): applied on tensor_dict.
      3. model_preprocess_fn: applied only on image tensor in tensor_dict.
      4. image_resizer_fn: applied on original image and instance mask tensor in
      5. one_hot_encoding: applied to classes tensor in tensor_dict.
      6. merge_multiple_boxes (optional): when groundtruth boxes are exactly the
         same they can be merged into a single box with an associated k-hot class
        tensor_dict: dictionary containing input tensors keyed by
        model_preprocess_fn: model's preprocess function to apply on image tensor.
          This function must take in a 4-D float tensor and return a 4-D preprocess
          float tensor and a tensor containing the true image shape.
        image_resizer_fn: image resizer function to apply on groundtruth instance
          `masks. This function must take a 3-D float tensor of an image and a 3-D
          tensor of instance masks and return a resized version of these along with
          the true shapes.
        num_classes: number of max classes to one-hot (or k-hot) encode the class
        data_augmentation_fn: (optional) data augmentation function to apply on
          input `tensor_dict`.
        merge_multiple_boxes: (optional) whether to merge multiple groundtruth boxes
          and classes for a given image if the boxes are exactly the same.
        retain_original_image: (optional) whether to retain original image in the
          output dictionary.
        use_multiclass_scores: whether to use multiclass scores as
          class targets instead of one-hot encoding of `groundtruth_classes`.
        use_bfloat16: (optional) a bool, whether to use bfloat16 in training.
        A dictionary keyed by fields.InputDataFields containing the tensors obtained
        after applying all the transformations.

    The data augmentation is performed on step 2 (if there are any) and resizing is performed on step 4.