Why does a TensorRT engine need to "warmUp" to get accurate inference profiling?

The trtexec --help binary shows that --warmUp=N Run for N milliseconds to warmup before measuring performance (default = 200).

However, why is a warmup needed? If the model (and thus intermediate buffers necessary for the forward pass) are allocated during model load time, then the only performance bottleneck would be the Host to Device Memory transfers. The nvidia docs indicate that this is corrected for by their enqueing strategy.

Therefore I'm not sure what else could result in an initial performance bottleneck. Any insight on why this is needed would be much appreciated.

Solution

TensorRT needs warmup for multiple reasons:

GPU could be in idle mode and the driver needs some time to go to an acceptable performance mode for profiling. You could use nvidia-smi to see the current mode.
on some configuration, the driver is not in persistant mode and should be loaded when calling the first CUDA function
in some cases, the driver needs to compile the PTX code found in the executable to get the BIN corresponding to the target architecture.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:65:00.0  On |                  N/A |
|  0%   47C    P8    36W / 350W |    473MiB / 12288MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+