How to manage the cuda streams and TensorRT context in multiple threads GPU application?

For a tensorrt trt file, we will load it to an engine, and create Tensorrt context for the engine. Then use cuda stream to inference by calling context->enqueueV2().

Do we need to call cudaCreateStream() after the Tensorrt context is created? Or just need to after selecting GPU device calling SetDevice()? How the TensorRT associate the cuda stream and Tensorrt context?

Can we use multiple streams with one Tensorrt context?

In a multiple thread C++ application, each thread uses one model to inference, one model might be loaded in more than 1 thread; So, in one thread, do we just need 1 engine, 1 context and 1 stream or multiple streams?

Solution

Do we need to call cudaCreateStream() after the Tensorrt context is created?

By cudaCreateStream() do you mean cudaStreamCreate()?

You can create them after you've created your engine and runtime.

As a bonus trivia, you don't necessarily have to use CUDA streams at all. I have tried copying my data from host to device, calling enqueueV2() and then copying the from device to host without using a CUDA stream. It worked fine.

How the TensorRT associate the cuda stream and Tensorrt context?

The association is that you can pass the same CUDA stream as an argument to all of the function calls. The following c++ code will illustrate this:

void infer(std::vector<void*>& deviceMemory, void* hostInputMemory, size_t hostInputMemorySizeBytes, cudaStream_t& cudaStream)
{
  auto success = cudaMemcpyAsync(deviceMemory, hostInputMemory, hostInputMemorySizeBytes, cudaMemcpyHostToDevice, cudaStream)
  if (not success) {... handle errors...}

  if (not executionContext.enqueueV2(static_cast<void**>(deviceMemory.data()), cudaStream, nullptr)
  { ... handle errors...}

  void* outputHostMemory; // allocate size for all bindings
  size_t outputMemorySizeBytes;
  auto success2 = cudaMemcpyAsync(&outputHostMemory, &deviceMemory.at(0), outputMemorySizeBytes, cudaMemcpyDeviceToHost, cudaStream);
  if (not success2) {... error handling ...}

  cudaStream.waitForCompletion();
}

You can check this repository if you want a full working example in c++. My code above is just an illustration.

Can we use multiple streams with one Tensorrt context?

If I understood your question correctly, according to this document the answer is no.

In a multiple thread C++ application, each thread uses one model to inference, one model might be loaded in more than 1 thread; So, in one thread, do we just need 1 engine, 1 context and 1 stream or multiple streams?

one model might be loaded in more than 1 thread

this doesn't sound right.

An engine (nvinfer1::ICudaEngine) is created from a TensorRT engine file. The engine creates an execution context that is used for inference.

This part of TensorRT developer guide states which operations are thread safe. The rest can be considered non-thread safe.