What's the use for converter.build() in TensorRT?

The official documentation on TensorRT lists two ways to convert a TensorFlow SavedModel into a TensorRT SavedModel: the first is

from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverterV2(input_saved_model_dir=input_saved_model_dir)
converter.convert()
converter.save(output_saved_model_dir)

and the second is

import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt

conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS
conversion_params = conversion_params._replace(
    max_workspace_size_bytes=(1<<32))
conversion_params = conversion_params._replace(precision_mode="FP16")
conversion_params = conversion_params._replace(
    maximum_cached_engiens=100)

converter = trt.TrtGraphConverterV2(
    input_saved_model_dir=input_saved_model_dir,
    conversion_params=conversion_params)
converter.convert()
def my_input_fn():
  for _ in range(num_runs):
    Inp1 = np.random.normal(size=(8, 16, 16, 3)).astype(np.float32)
    inp2 = np.random.normal(size=(8, 16, 16, 3)).astype(np.float32)
    yield inp1, inp2
converter.build(input_fn=my_input_fn)
converter.save(output_saved_model_dir)

saved_model_loaded = tf.saved_model.load(
    output_saved_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
    signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
frozen_func = convert_to_constants.convert_variables_to_constants_v2(
    graph_func)
output = frozen_func(input_data)[0].numpy()

Stripping out all of the boilerplate code for imports, inference etc the difference seems to lie in the call to converter.build(). The documentation explains this function as such:

"This method optimizes the converted function (returned by convert()) by building TensorRT engines. This is useful in case the user wants to perform the optimizations before runtime. The optimization is done by running inference on the converted function using the input data received from the argument input_fn. This argument is a generator function that yields input data as a list or tuple. "

What does "before runtime" mean in this context? Will the "optimizations" be performed upon model loading, upon the first inference, or upon every single inference using the converted model? What are those optimizations, even? Isn't converting the model to TensorRT an optimization in itself?

I am asking because if I call converter.build() the conversion seems to fail in unpredictable ways after taking a LOT of time (more than two hours) to run without producing any sensible output, so I was wondering how much am I losing by not calling it and whether there is more comprehensive documentation on using TF2.x SavedModels with TensorRT.

Thanks in advance to whoever can answer!!

Solution

From my understanding (after reading TensorFlow's docs) the converter.convert() function converts the graph to tf-trt, replacing whatever nodes it can with TRTEngineOp, but it doesn't create the actual engine files used during inference.

The call to converter.build() however creates the engine files, but for the input sizes of the input provided by input_fn, and for the platform that build is being run on. So, reasons for not calling converter.build() would be that you don't know the input shapes beforehand, or can't do the build on the platform that you're going to run inference on.

I find it hard to imagine that new engine files are created for each forward pass, but definitely for each new input shape. It's unclear from the examples whether the input from input_fn is used in any other way than providing information regarding input shape, but if you return inputs of different shapes, one engine file should be created for each input size.

As an example, providing the following input function would produce one engine for input size (112,112,3) and one for (224,224,3):

def input_fn():
  input_sizes = [[112, 112], [224, 224]]
  for size in input_sizes:
    inp1 = np.random.normal(size=(1, *size, 3)).astype(np.float32)
    yield [inp1]

As for your input_fn, do you have two images as input to your network? What worked for me was returning a single image in a list like in the above sample (tuple didn't work for some reason, even though the docs say it should).

Hope this helped.