I developed a API(Flask based) that uses Yolo v5. I used Ngnix and Gunicorn to serve this service. Everything works fine on just one request. It does not matter if I give 10 core CPU or 50 Core CPU, only one request will be answered at a time.
The weights are all loaded outside the request and only the loaded weights are used at the time of request
weights = "./x.pt"
imgsz = 640
device = ""
set_logging()
device = select_device(device)
# Load model
model = attempt_load(weights, map_location=device) # load FP32 model
stride = int(model.stride.max()) # model stride
imgsz = check_img_size(imgsz, s=stride) # check img_size
# Second-stage classifier
classify = False
if classify:
modelc = load_classifier(name='resnet101', n=2) # initialize
modelc.load_state_dict(torch.load('resnet101.pt', map_location=device)['model']).to(device).eval()
@application.route("URL", methods=['POST'])
def XXX():
...
...
...
I'd be very grateful for any suggestion. Thanks and also sorry about my English.
the problem was solved. I always set the number of Workers(in Gunicorn setting) equal to the number of CPU cores. there was problem. when I set the number of Workers to 1, problem was solved.
file address (centos):
/etc/systemd/system/X.service
changed to:
WorkingDirectory=X
Environment="PATH=X"
ExecStart=X/venv/bin/gunicorn --workers 1 --timeout 200 --bind unix:X.sock -m 007 run