I have a big Machine learning/ Computer vision project that is using an ONNX model, using python. the project takes around 3 seconds (locally) just to load the model + inference.
Time taken to load onnx model : 0.2702977657318115 Time taken for onnx inference 1.673530101776123 Time taken for onnx inference 0.7677013874053955
After deploying the project, this loading time is always initiated with each individual hit on the server.
for example, if 4 users request at once, all the results will take around 30 seconds. when having only 1 request, it takes only around 10 second.
Problem Is there any way to load the onnx model only once when initializing the server, not with every and each post request?
I tried async.io
it helped queuing the requests, but still, the last request will have to wait 30 seconds for the results, even though, the CPU usage is not at 100%. I am not sure the solution to my problem is loading the onnx model only once, or multithreading or am I doing the best thing by applying async.io to my project.
Are you trying to load the model and then doing the inference every time that you have a request?
You should load once and keep the session throughout the time life of the inference process. You could also look into execution providers that come with ONNXruntime to try and speed up the inference time like CUDA, tensorRT
...
import onnxruntime as ort
def load_model(onnx_model_path):
inference_session = ort.InferenceSession(
onnx_model_path,
)
# get input and output names of model layers for inference placeholder
inputs = inference_session.get_inputs()[0].name
outputs = [
output.name for output in inference_session.get_outputs()
]
return inference_session, {"inputs": inputs, "outputs": outputs}
def main():
inf_session, input_output_dict = load_model("your/model.onnx")
while True:
inf_session.run(input_output_dict["outputs"], {
str(input_output_dict[inputs]): your_array_input})