Search code examples
tensorflowtensorrt

Why more output data using INT8 inference using TensorRT


Implemented INT8 engine inference using TensorRT.

Training batch size is 50 and inference batch size is 1.

But at output inference

[outputs] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=1)

Output size is 13680000.

It has to be 273600. Using FP32/FP16 produced output size 273600.

Why there is 5 times more output size using INT8?

My inference code is

with engine.create_execution_context() as context:
      fps_time = time.time()
      inputs, outputs, bindings, stream = common.allocate_buffers(engine)
      im = np.array(frm, dtype=np.float32, order='C')
      #im = im[:,:,::-1]
      inputs[0].host = im.flatten()
      [outputs] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=1)
      outputs = outputs.reshape((60, 80, 57))

Solution

  • It is because of train batch size is 50 and memory is allocated for that batch size.

    Need to reshape as outputs = outputs.reshape((50, 60, 80, 57))

    Then take the [0] tensor, that is the result when we do inference with one image.