ONNX Runtime: io_binding.bind_input causing "no data transfer from DeviceType:1 to DeviceType:0"

I am using Nvidia Triton Inference Server and ONNX model for inference on a GPU instance. The Dockerfile, containing the environment, inference server and models contains following from/pip lines:

FROM --platform=linux/amd64 nvcr.io/nvidia/tritonserver:23.12-py3

RUN pip install torch transformers onnx onnxruntime-gpu onnxruntime

the model.py for the Triton Inference Server has been simplified to following:

import onnxruntime as ort
import torch
import numpy as np

session = ort.InferenceSession("path/to/onnx.model", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

...

io_binding = session.io_binding()
pt_script_embeddings = torch.rand(
    size=(100, 2010), dtype=torch.float32, device="cuda:0"
).contiguous()

io_binding.bind_input(
    name="np_script_embeddings",
    device_type="cuda",
    device_id=0,
    element_type=np.float32,
    shape=tuple(pt_script_embeddings.shape),
    buffer_ptr=pt_script_embeddings.data_ptr(),
)

logit_output_shape = (100, 2)
logit_output = torch.empty(logit_output_shape, dtype=torch.float32, device='cuda:0').contiguous()
io_binding.bind_output(
    name="np_logits",
    device_type="cuda",
    device_id=0,
    element_type=np.float32,
    shape=tuple(logit_output.shape),
    buffer_ptr=logit_output.data_ptr()
)

session.run_with_iobinding(io_binding)
outputs = logit_output.cpu().numpy()

Unfortunately, the error below is triggered at the line io_binding.bind_input causing me a lot of grief:

RuntimeError: Error when binding input: There's no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]

Note: articles reviewed before the SO submission:

Solution

To resolve the issue I needed to carefully match versions ofcuda, pytorch and onnxruntime provided by the tritonserver docker image with the Python packages of torch and onnxruntime-gpu installed manually. Here is the process in details:

Understand what version of CUDA is currently supported by the onnxruntime-gpu by visiting onnx cuda execution provider. In my case it was cuda==12.2
Navigate to the Triton IS release notes and look for the Container Version with the matching cuda version from prior step. In my case it was tritonserver:23.10-py3
Navigate to the Triton IS version matrix to retrieve the version of PyTorch included with the Triton IS Docker Image. In my case torch 2.1

Base on the collected versions, update the environment. In my case it is the Docker image with following changes:

FROM --platform=linux/amd64 nvcr.io/nvidia/tritonserver:23.10-py3

RUN pip install transformers
RUN pip install torch==2.1

# https://onnxruntime.ai/docs/install/
# https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements
RUN pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

NOTE: if your build environment has no access to the Azure repo: https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ then retrieve and install the files manually from: https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-12 (make sure to correct cuda-12 for your version)