I have example code for creating session for the ONNX model.
so = ort.SessionOptions()
so.inter_op_num_threads = 10
so.intra_op_num_threads = 10
session = ort.InferenceSession('example.onnx',
sess_options=so,
providers=['CUDAExecutionProvider'])
So, when I use input with the same size, like 200, It's okay and works very fast.
for i in tqdm(range(1000)):
array = np.zeros((1, 200, 80), dtype=np.float32)
embeddings = session.run(output_names=['embs'], input_feed={'feats': array})
But, when I try to use some dynamic input, it starts working very slowly for the first few hundred or even thousand iterations, and then somehow optimizes and works the same as the first example.
for i in tqdm(range(1000)):
array = np.zeros((1, random.randint(200, 1000), 80), dtype=np.float32)
embeddings = session.run(output_names=['embs'], input_feed={'feats': array})
Is there anyway to speed-up second example?
I tried batching, but because the difference between input sizes can be very different, it makes the output a little bit less accurate.
This is because ONNX models loaded with onnxruntime
are not really dynamic, only their inputs are.
When the computational graph is loaded, i.e. when you create a InferenceSession
, onnxruntime
allocates memory for all tensors needed to execute the model. If the model was exported with dynamic inputs, onnxruntime
does not yet know how much memory to reserve for all of the input, intermediate, and output tensors. So it initially just loads the graph, and then does the tensor allocations during the first run
call.
However, if you run the model again with inputs bigger than the previous call, it needs to do the whole previous step again to accommodate the larger memory requirement. So what you're seeing in the last block, is every time random.randint
hits a number larger than the previous maximum, onnxruntime
has to rewrite a lot of GPU memory. This goes on until randint
outputs one thousand, larger inputs are no longer possible, and thus all subsequent calls will run smoothly.
Is there anyway to speed-up second example?
Call run
in the beginning if your inference script with the biggest input shape present in your dataset.