Search code examples
optimizationdeep-learninggpuonnxonnxruntime

How to optimize onnx inference for dynamic input


I have example code for creating session for the ONNX model.

    so = ort.SessionOptions()
    so.inter_op_num_threads = 10
    so.intra_op_num_threads = 10
    session = ort.InferenceSession('example.onnx',
                                   sess_options=so,
                                   providers=['CUDAExecutionProvider'])


So, when I use input with the same size, like 200, It's okay and works very fast.

    for i in tqdm(range(1000)):
        array = np.zeros((1, 200, 80), dtype=np.float32)
        embeddings = session.run(output_names=['embs'], input_feed={'feats': array})

But, when I try to use some dynamic input, it starts working very slowly for the first few hundred or even thousand iterations, and then somehow optimizes and works the same as the first example.

    for i in tqdm(range(1000)):
        array = np.zeros((1, random.randint(200, 1000), 80), dtype=np.float32)
        embeddings = session.run(output_names=['embs'], input_feed={'feats': array})

Is there anyway to speed-up second example?

I tried batching, but because the difference between input sizes can be very different, it makes the output a little bit less accurate.


Solution

  • This is because ONNX models loaded with onnxruntime are not really dynamic, only their inputs are.

    When the computational graph is loaded, i.e. when you create a InferenceSession, onnxruntime allocates memory for all tensors needed to execute the model. If the model was exported with dynamic inputs, onnxruntime does not yet know how much memory to reserve for all of the input, intermediate, and output tensors. So it initially just loads the graph, and then does the tensor allocations during the first run call.

    However, if you run the model again with inputs bigger than the previous call, it needs to do the whole previous step again to accommodate the larger memory requirement. So what you're seeing in the last block, is every time random.randint hits a number larger than the previous maximum, onnxruntime has to rewrite a lot of GPU memory. This goes on until randint outputs one thousand, larger inputs are no longer possible, and thus all subsequent calls will run smoothly.

    Is there anyway to speed-up second example?

    Call run in the beginning if your inference script with the biggest input shape present in your dataset.