Tensorflow conv1d/Keras Conv1D strange performance variation

I am getting somewhat unexpected results when measuring the processing runtime of the Conv1D layer and wonder if anybody understands the results. Before going on I note that the observation is not only linked to the Conv1D layer but can be observed similarly for the tf.nn.conv1d function.

The code I am using is very simple

import os
# silence verbose TF feedback
if 'TF_CPP_MIN_LOG_LEVEL' not in os.environ:
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
import tensorflow as tf
import time

def fun(sigl, cc, bs=10):
    oo = tf.ones((bs, sigl, 200), dtype=tf.float32)
    start_time = time.time()
    ss=cc(oo).numpy()
    dur = time.time() - start_time
    print(f"size {sigl} time: {dur:.3f} speed {bs*sigl / 1000 / dur:.2f}kHz  su {ss.shape}")


cctf2t = tf.keras.layers.Conv1D(100,10)
for jj in range(2):
    print("====")
    for ii in range(30):
        fun(10000+ii, cctf2t, bs=10)

I was expecting to observe the first call to be slow and the others to show approximately similar runtime. It turns out that the behavior is quite different. Assuming the code above is stored in a script called debug_conv_speed.py I get the following on an NVIDIA GeForce GTX 1050 Ti

$> ./debug_conv_speed.py 
====
size 10000 time: 0.901 speed 111.01kHz  su (10, 9991, 100)
size 10001 time: 0.202 speed 554.03kHz  su (10, 9992, 100)
...
size 10029 time: 0.178 speed 563.08kHz  su (10, 10020, 100)
====
size 10000 time: 0.049 speed 2027.46kHz  su (10, 9991, 100)
...
size 10029 time: 0.049 speed 2026.87kHz  su (10, 10020, 100)

where ... indicates approximately the same result. So as expected, the first time is slow, then for each input length, I get the same speed of about 550kHz. But then for the repetition, I am astonished to find all operations to run about 4 times faster, with 2MHz.

The results are even more different on a GeForce GTX 1080. There the first time a length is used it runs at about 200kHz, and for the repetitions, I find a speed of 1.8MHz.

In response to the https://stackoverflow.com/a/71184388/3932675 I add a second variant of the code that uses tf.function a

import os
# silence verbose TF feedback
if 'TF_CPP_MIN_LOG_LEVEL' not in os.environ:
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
import tensorflow as tf
import time
from functools import partial

print(tf.config.list_physical_devices())

class run_fun(object):
    def __init__(self, ll, channels):
        self.op = ll
        self.channels = channels
    @tf.function(input_signature=(tf.TensorSpec(shape=[None,None,None]),),
                 experimental_relax_shapes=True)
    def __call__(self, input):
        print("retracing")
        return self.op(tf.reshape(input, (tf.shape(input)[0], tf.shape(input)[1], self.channels)))


def run_layer(sigl, ll, bs=10):
    oo = tf.random.normal((bs, sigl, 200), dtype=tf.float32)
    start_time = time.time()
    ss=ll(oo).numpy()
    dur = time.time() - start_time
    print(f"len {sigl} time: {dur:.3f} speed {bs*sigl / 1000 / dur:.2f}kHz su {ss.shape}")


ww= tf.ones((10, 200, 100))
ll=partial(tf.nn.conv1d, filters=ww, stride=1, padding="VALID", data_format="NWC")

run_ll = run_fun(ll, 200)
for jj in range(2):
    print(f"=== run {jj+1} ===")
    for ii in range(5):
        run_layer(10000+ii, run_ll)
        # alternatively for eager mode run
        # run_layer(10000+ii, ll)

the result after running on google's colab GPU

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
=== run 1 ===
retracing
len 10000 time: 10.168 speed 9.83kHz su (10, 9991, 100)
len 10001 time: 0.621 speed 161.09kHz su (10, 9992, 100)
len 10002 time: 0.622 speed 160.80kHz su (10, 9993, 100)
len 10003 time: 0.644 speed 155.38kHz su (10, 9994, 100)
len 10004 time: 0.632 speed 158.18kHz su (10, 9995, 100)
=== run 2 ===
len 10000 time: 0.080 speed 1253.34kHz su (10, 9991, 100)
len 10001 time: 0.053 speed 1898.41kHz su (10, 9992, 100)
len 10002 time: 0.052 speed 1917.43kHz su (10, 9993, 100)
len 10003 time: 0.067 speed 1499.43kHz su (10, 9994, 100)
len 10004 time: 0.095 speed 1058.60kHz su (10, 9995, 100)

This shows that with the given tf.function args retracing is not happening and the performance shows the same difference.

Does anybody know how to explain this?

Solution

I am pretty sure to have found the explanation in the source of TensorFlow cudnn, and share the insight here for others (notably those who upvoted the question) that encounter the same problem.

cuda supports a number of convolution kernels that in the current version of TensorFlow 2.9.0 are obtained by means of CudnnSupport::GetConvolveRunners here

https://github.com/tensorflow/tensorflow/blob/21368c687cafdf97fac3dd0eefaed710df0068a2/tensorflow/stream_executor/cuda/cuda_dnn.cc#L4557

Which is then used here in the various autotune functions

https://github.com/tensorflow/tensorflow/blob/21368c687cafdf97fac3dd0eefaed710df0068a2/tensorflow/core/kernels/conv_ops_gpu.cc#L365

It appears that each time a configuration consisting of data shape, filter shape, and maybe other parameters are encountered the cuda driver tests all of the kernels and retains the most efficient one. This is a very nice optimization for most cases, notably training with constant batch shapes, or inference with constant image sizes. For inference with audio signals that may have arbitary lengths (e.g. audio signals with 48000Hz sample rate covering duration from 1s to 20s have nearly 1 million different lengths), the cuda implementation is testing most of the time all kernels versions. It hardly ever benefits, from the information which of the kernels is the most efficient one for any given configuration, as the same configuration is hardly ever encountered a second time.

For my use case, I now use overlap-add-based processing with fixed signal length and improved inference time by about factor 4.