Implemented INT8 engine inference using TensorRT.
Training batch size is 50 and inference batch size is 1.
But at output inference
[outputs] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=1)
Output size is 13680000.
It has to be 273600. Using FP32/FP16 produced output size 273600.
Why there is 5 times more output size using INT8?
My inference code is
with engine.create_execution_context() as context:
fps_time = time.time()
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
im = np.array(frm, dtype=np.float32, order='C')
#im = im[:,:,::-1]
inputs[0].host = im.flatten()
[outputs] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=1)
outputs = outputs.reshape((60, 80, 57))
It is because of train batch size is 50 and memory is allocated for that batch size.
Need to reshape as outputs = outputs.reshape((50, 60, 80, 57))
Then take the [0] tensor, that is the result when we do inference with one image.