Results mismatch between convolution algorithms in Tensorflow/CUDA

I'm training a convolutional autoencoder and noticed this warning:

Tensorflow: 2.5-gpu from pip
Driver: 460.80
cuda: 11.2.2
cudnn: 8.1.1
XLA: Yes
Mixed precision: Yes

26/27 [===========================>..] - ETA: 0s - loss: 1.0554 - pre_dense_out_loss: 0.9997 - de_conv1dtranspose_out_loss: 0.55782021-06-05 21:28:17.678118: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 0: 95.25 vs 80.8125
2021-06-05 21:28:17.678132: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 1: 95.6875 vs 81
2021-06-05 21:28:17.678136: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 2: 95.4375 vs 82.125
2021-06-05 21:28:17.678139: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 3: 95.3125 vs 80.5625
2021-06-05 21:28:17.678141: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 4: 95.375 vs 81.3125
2021-06-05 21:28:17.678145: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 5: 94.9375 vs 79.8125
2021-06-05 21:28:17.678148: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 6: 95.3125 vs 81
2021-06-05 21:28:17.678151: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 7: 95.625 vs 82
2021-06-05 21:28:17.678153: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 8: 94.75 vs 78.5625
2021-06-05 21:28:17.678156: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 9: 95.25 vs 80.25
2021-06-05 21:28:17.678170: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:545] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
%custom-call.20 = (f16[1,5,24,24]{2,1,0,3}, u8[0]{0}) custom-call(f16[3778,1,50,24]{3,2,1,0} %bitcast.237, f16[3778,1,10,24]{3,2,1,0} %arg45.46), window={size=1x5 stride=1x5}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convBackwardFilter", metadata={op_type="Conv2DBackpropFilter" op_name="gradient_tape/model/de_conv1dtranspose_2/conv1d_transpose/Conv2DBackpropFilter"}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" for 1+TC vs 0+TC
2021-06-05 21:28:17.678174: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:192] Device: GeForce RTX 3070
2021-06-05 21:28:17.678177: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:193] Platform: Compute Capability 8.6
2021-06-05 21:28:17.678180: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:194] Driver: 11020 (460.80.0)
2021-06-05 21:28:17.678182: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:195] Runtime: <undefined>
2021-06-05 21:28:17.678185: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:202] cudnn version: 8.1.1

This is a fresh build on Ubuntu 20.04. I didn't notice this warning when I was running on a RTX 2060 in Windows before. The input data is a bit big so MRE might be difficult. Does anyone know what this warning is about?

Solution

This could be the effect of accumulation with a low precision (e.g. FP16) data type.

Which data types are you using? And which algorithms?

From: https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html

Mixed Precision Numerical Accuracy

When the computation precision and the output precision are not the same, it is possible that the numerical accuracy will vary from one algorithm to the other.

For example, when the computation is performed in FP32 and the output is in FP16, the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 (ALGO_0) has lower accuracy compared to the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (ALGO_1). This is because ALGO_0 does not use extra workspace, and is forced to accumulate the intermediate results in FP16, i.e., half precision float, and this reduces the accuracy. The ALGO_1, on the other hand, uses additional workspace to accumulate the intermediate values in FP32, i.e., full precision float.