Pre-allocating dynamic shaped tensor memory for ONNX runtime inference?

I am currently trying out onnxruntime-gpu and I wish to perform pre-processing of images on the GPU using NVIDIA DALI. Everything works correctly and I am able to pre-process my images, but the problem is that I wish to keep all of the data on device in order not to create bottle necks from copying data back and forth from device and host.

The onnxruntime library allows for IO bindings to bind inputs and outputs to the device. The problem is that this is incredibly static, which makes for issues when pre-allocating memory for output tensors of varying shapes. For example, I am using a RetinaNet which produces different sized predictions, which I can not seem to handle.

For pre-processing, I use the following code:

class ImagePipeline(Pipeline):
    def __init__(self, file_list, batch_size, num_threads, device_id):
        super(ImagePipeline, self).__init__(batch_size, num_threads, device_id)
        self.input = ops.readers.File(file_root="", file_list=file_list)
        self.decode = ops.decoders.Image(device="mixed", output_type=types.RGB)
        self.resize = ops.Resize(device="gpu", resize_x=800, resize_y=800)
        
        self.normalize = ops.CropMirrorNormalize(
            device="gpu",
            dtype=types.FLOAT,
            output_layout=types.NCHW,
            crop=(800, 800),
            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
        )

    def define_graph(self):
            inputs, labels = self.input()
            images = self.decode(inputs)
            images = self.resize(images)
            images = self.normalize(images)
            return images, labels

This can correctly create batches of shape (BATCH_SIZE, 800, 800) images. For running inference with these batches, I use the following snippet:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor:
    binding = session.io_binding()
    x_tensor = x.contiguous()
    z_tensor = torch.zeros(CURR_SIZE, 4, dtype=torch_type, device=DEVICE).contiguous()

    binding.bind_input(
        name=session.get_inputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.float32,
        shape=tuple(x_tensor.shape),
        buffer_ptr=x_tensor.data_ptr())
    
    binding.bind_output(
        name=session.get_outputs()[0].name,
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.int64,
        shape=tuple(x_tensor.shape),
        buffer_ptr=z_tensor.data_ptr())

    session.run_with_iobinding(binding)

    return z_tensor.squeeze(0)

This is where the problem occurs. I can not create correctly shaped z_tensors. I use the pre-trained retina net from https://pytorch.org/vision/main/models/generated/torchvision.models.detection.retinanet_resnet50_fpn_v2.html#torchvision.models.detection.retinanet_resnet50_fpn_v2.

I have found a work-around which is the following:

def run_with_data_on_device(x):
    x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x)
    io_binding = session.io_binding()
    io_binding.bind_input(name=session.get_inputs()[0].name, device_type=x_ortvalue.device_name(), device_id=0, element_type=x.dtype, shape=x_ortvalue.shape(), buffer_ptr=x_ortvalue.data_ptr())
    io_binding.bind_output(name=session.get_outputs()[-1].name, device_type=DEVICE_NAME, device_id=DEVICE_INDEX, element_type=x.dtype, shape=x_ortvalue.shape())
    session.run_with_iobinding(io_binding)

    z = io_binding.get_outputs()

    return z[0]

But this naturally causes the problem of doing a round-trip to the host which is unnessecary... Am I overlooking something obvious? Why can I not initialize the z_tensor as (None, None) and have a dynamically shaped output tensor?

UPDATED CODE:

def run_with_torch_tensors_on_device(x: torch.Tensor, CURR_SIZE: int, torch_type: torch.dtype = torch.float) -> torch.Tensor: binding = session.io_binding() x_tensor = x.contiguous() z_tensor = torch.zeros((CURR_SIZE,91), dtype=torch_type, device=DEVICE).contiguous()

binding.bind_input(
    name=session.get_inputs()[0].name,
    device_type=DEVICE_NAME,
    device_id=DEVICE_INDEX,
    element_type=np.float32,
    buffer_ptr=x_tensor.data_ptr(),
    shape=x_tensor.shape)



binding.bind_output(session.get_outputs()[-1].name, "cuda")

session.run_with_iobinding(binding)

ort_output = binding.get_outputs()
return ort_output[0]


However, this returns: ```<onnxruntime.capi.onnxruntime_inference_collection.OrtValue object at 0x7f237bf1ebc0>```

Solution

The API of IO Binding: https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding

You can actually bind output with name only, since the other parameters are optional. If so, the memory will be allocated by onnxruntime. It could help the case of dynamic output shape.

get_outputs() return OrtValues in device, and copy_outputs_to_cpu() could copy data to CPU.

There are also many examples in the page. See the first example in "Data on device" section: https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device