How do I do async inference on OpenVino

I wrote a python server that uses an OpenVino network to run inference on incoming requests. In order to speed things up, I receive requests in multiple threads, and I would like to run the inferences concurrently. It seems that whatever I do, the times I get are the same as non-concurrent solutions - which makes me think I've missed something.

I'm writing it in Python, using openvino 2019.1.144. I'm using multiple requests to the same plugin and network in order to try to make the inferences run concurrently.

def __init__(self, num_of_requests: int = 4):
   self._plugin = IEPlugin("CPU", plugin_dirs=None)
   model_path = './Det/'
   model_xml = os.path.join(model_path, "ssh_graph.xml")
   model_bin = os.path.join(model_path, "ssh_graph.bin")
   net = IENetwork(model=model_xml, weights=model_bin)
   self._input_blob = next(iter(net.inputs))

   # Load network to the plugin
   self._exec_net = self._plugin.load(network=net, num_requests=num_of_requests)
   del net

def _async_runner(detect, images_subset, idx):
    for img in images_subset:
        request_handle = self._exec_net.start_async(request_id=idx, inputs={self._input_blob: img})
        request_handle.wait()


def run_async(images):  # These are the images to infer
    det = Detector(num_of_requests=4)
    multiplier = int(len(images)/4)
    with ThreadPoolExecutor(4) as pool:
        futures = []
        for idx in range(0,3):
            images_subset = images[idx*multiplier:(idx+1)*multiplier-1]
            futures.append(pool.submit(_async_runner, det.detect, images_subset, idx))

When I run 800 inferences in sync mode, I get an avg. run time of 290ms When I run in async mode I get avg run time of 280ms. These are not substantial improvements. What am I doing wrong?

Solution

If you use wait(), the execution thread blocks until the result is available. If you want to use a truly async mode, you will need wait(0) which does not block the execution. Just launch the inference whenever you need and store the request_id. Then, you can check if the results are available checking if the returned value of wait(0) is 0. Be careful not to use the same request_id while the IE is doing the inference, that will cause a collision and it'll raise an exception.

However, in the code you provided, you cannot do this, because you are creating a thread pool in wich each thread executes inference of the image subset into a unique request_id. In fact, this is a parallel execution wich will give you a pretty fine performance, but it isn't "async" mode.

A truly async mode would be something like this:

while still_items_to_infer():
    get_item_to_infer()
    get_unused_request_id()
    launch_infer()
    do_someting()
    if results_available():
        get_inference_results()
        free_request_id()
        #This may be in a new thread
        process_inference_results()

This way, you are dispatching continuous inferences while waiting for them to finish.