I used TensorRT and Tensorflow model is converted to TensorRT engines in FP16 and FP32 modes.
Tested with 10 images and FP32 is not even two times faster than FP16 mode. Expected minimum two times faster. This is Titan RTX specs using Turing architecture
Using Titan RTX
FP16
msec: 0.171075
msec: 0.134830
msec: 0.129984
msec: 0.128638
msec: 0.118196
msec: 0.123429
msec: 0.134329
msec: 0.119506
msec: 0.117615
msec: 0.127687
FP32
msec: 0.199235
msec: 0.180985
msec: 0.153394
msec: 0.148267
msec: 0.151481
msec: 0.169578
msec: 0.159987
msec: 0.173443
msec: 0.159301
msec: 0.155503
EDIT_1: Accoring to the reply from @y.selivonchyk, tested on Tesla T4. But FP16 is not faster than FP32.
Using Tesla T4
FP16
msec: 0.169800
msec: 0.136175
msec: 0.127025
msec: 0.130406
msec: 0.129874
msec: 0.122248
msec: 0.128244
msec: 0.126983
msec: 0.131111
msec: 0.138897
FP32
msec: 0.168589
msec: 0.130539
msec: 0.122617
msec: 0.120955
msec: 0.128452
msec: 0.122426
msec: 0.125560
msec: 0.130016
msec: 0.126965
msec: 0.121818
Is that result acceptable? Or what else I need to look into?
In this document on page 15, there is 5 times images/sec difference between FP32 and FP16.
My code for engine serialization from UFF model and Inference are shown below.
def serializeandsave_engine(model_file):
# For more information on TRT basics, refer to the introductory samples.
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
builder.max_batch_size = 1#max_batch_size
builder.max_workspace_size = 1 << 30
builder.fp16_mode = True
builder.strict_type_constraints = True
# Parse the Uff Network
parser.register_input("image", (3, height, width))#UffInputOrder.NCHW
parser.register_output("Openpose/concat_stage7")#check input output names with tf model
parser.parse(model_file, network)
# Build and save the engine.
engine = builder.build_cuda_engine(network)
serialized_engine = engine.serialize()
with open(engine_path, 'wb') as f:
f.write(engine.serialize())
return
def infer(engine, x, batch_size, context):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
#img = np.array(x).ravel()
np.copyto(inputs[0].host, x.flatten()) #1.0 - img / 255.0
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
Titan series of graphics cards was always just a more beefed version of the consumer graphics card with a higher number of cores. Titans never had dedicated FP16 cores to allow them run faster with half-precision training. (luckly, unlike 1080s, they would not run slower with FP16).
This assumption is confirmed in the next 2 reviews: pugetsystems and tomshardaware, where Titan RTX shows moderate improvement of about 20% when using half-precision floats.
In short, FP16 are only faster when dedicated hardware modules are present on the chip, which is generally not the case for Titan line up. Yet, FP16 still allows to decrease memory consumption during training and run even larger models.