OnnxRuntime vs OnnxRuntime+OpenVinoEP inference time difference

I'm trying to accelerate my model's performance by converting it to OnnxRuntime. However, I'm getting weird results, when trying to measure inference time.

While running only 1 iteration OnnxRuntime's CPUExecutionProvider greatly outperforms OpenVINOExecutionProvider:

CPUExecutionProvider - 0.72 seconds
OpenVINOExecutionProvider - 4.47 seconds

But if I run let's say 5 iterations the result is different:

CPUExecutionProvider - 3.83 seconds
OpenVINOExecutionProvider - 14.13 seconds

And if I run 100 iterations, the result is drastically different:

CPUExecutionProvider - 74.19 seconds
OpenVINOExecutionProvider - 46.96seconds

It seems to me, that the inference time of OpenVinoEP is not linear, but I don't understand why. So my questions are:

Why does OpenVINOExecutionProvider behave this way?
What ExecutionProvider should I use?

The code is very basic:

import onnxruntime as rt
import numpy as np
import time 
from tqdm import tqdm

limit = 5
# MODEL
device = 'CPU_FP32'
model_file_path = 'road.onnx'

image = np.random.rand(1, 3, 512, 512).astype(np.float32)

# OnnxRuntime
sess = rt.InferenceSession(model_file_path, providers=['CPUExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name

start = time.time()
for i in tqdm(range(limit)):
    out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)

# OnnxRuntime + OpenVinoEP
sess = rt.InferenceSession(model_file_path, providers=['OpenVINOExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name

start = time.time()
for i in tqdm(range(limit)):
    out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)

Solution

The use of ONNX Runtime with OpenVINO Execution Provider enables the inferencing of ONNX models using ONNX Runtime API while the OpenVINO toolkit runs in the backend. This accelerates ONNX model's performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, VPU and FPGA.

Generally, CPU Execution Provider works best with small iteration since its intention is to keep the binary size small. Meanwhile, the OpenVINO Execution Provider is intended for Deep Learning inference on Intel CPUs, Intel integrated GPUs, and Intel® MovidiusTM Vision Processing Units (VPUs).

This is why the OpenVINO Execution Provider outperforms the CPU Execution Provider during larger iterations.

You should choose Execution Provider that would suffice your own requirements. If you going to execute complex DL with large iteration, then go for OpenVINO Execution Provider. For a simpler use case, where you need the binary size to be smaller with smaller iterations, you can choose the CPU Execution Provider instead.

For more information, you may refer to this ONNX Runtime Performance Tuning