FP16 not even two times faster than using FP32 in TensorRT

I used TensorRT and Tensorflow model is converted to TensorRT engines in FP16 and FP32 modes.

Tested with 10 images and FP32 is not even two times faster than FP16 mode. Expected minimum two times faster. This is Titan RTX specs using Turing architecture

Using Titan RTX
    FP16
    msec: 0.171075
    msec: 0.134830
    msec: 0.129984
    msec: 0.128638
    msec: 0.118196
    msec: 0.123429
    msec: 0.134329
    msec: 0.119506
    msec: 0.117615
    msec: 0.127687


    FP32
    msec: 0.199235
    msec: 0.180985
    msec: 0.153394
    msec: 0.148267
    msec: 0.151481
    msec: 0.169578
    msec: 0.159987
    msec: 0.173443
    msec: 0.159301
    msec: 0.155503

EDIT_1: Accoring to the reply from @y.selivonchyk, tested on Tesla T4. But FP16 is not faster than FP32.

Using Tesla T4
FP16
msec: 0.169800
msec: 0.136175
msec: 0.127025
msec: 0.130406
msec: 0.129874
msec: 0.122248
msec: 0.128244
msec: 0.126983
msec: 0.131111
msec: 0.138897

FP32
msec: 0.168589
msec: 0.130539
msec: 0.122617
msec: 0.120955
msec: 0.128452
msec: 0.122426
msec: 0.125560
msec: 0.130016
msec: 0.126965
msec: 0.121818

Is that result acceptable? Or what else I need to look into?

In this document on page 15, there is 5 times images/sec difference between FP32 and FP16.

My code for engine serialization from UFF model and Inference are shown below.

def serializeandsave_engine(model_file):
    # For more information on TRT basics, refer to the introductory samples.
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
        builder.max_batch_size = 1#max_batch_size
        builder.max_workspace_size = 1 <<  30
        builder.fp16_mode = True
        builder.strict_type_constraints = True
        # Parse the Uff Network
        parser.register_input("image", (3, height, width))#UffInputOrder.NCHW
        parser.register_output("Openpose/concat_stage7")#check input output names with tf model
        parser.parse(model_file, network)
        # Build and save the engine.
        engine = builder.build_cuda_engine(network)
        serialized_engine = engine.serialize()
        with open(engine_path, 'wb') as f:
           f.write(engine.serialize())
        return

def infer(engine, x, batch_size, context):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    #img = np.array(x).ravel()
    np.copyto(inputs[0].host, x.flatten())  #1.0 - img / 255.0
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()

Solution

Titan series of graphics cards was always just a more beefed version of the consumer graphics card with a higher number of cores. Titans never had dedicated FP16 cores to allow them run faster with half-precision training. (luckly, unlike 1080s, they would not run slower with FP16).

This assumption is confirmed in the next 2 reviews: pugetsystems and tomshardaware, where Titan RTX shows moderate improvement of about 20% when using half-precision floats.

In short, FP16 are only faster when dedicated hardware modules are present on the chip, which is generally not the case for Titan line up. Yet, FP16 still allows to decrease memory consumption during training and run even larger models.