Client Server side real time object detection

I am planning to implement the real time object detection function in the smart phone. For ios, I know that I can use CoreML with tiny YOLO to complete this function. However, the detection speed in smart phone is slow and accuracy is not good.

Therefore, I hope that I can build the object detection in the Python server. The smart phone will capture the frame or use live streaming to pass frame into the server. Then, the server will return the result to the smart phone. The smart phone will display the bounding box. Is that possible to complete all the things within less latency?

Solution

Are you sure it's "slow and the accuracy is not good"?

TinyYOLO runs at 200 FPS on the iPhone XS and even the full YOLO-v3 runs at real-time frame rates on recent devices. For an example, see: https://github.com/Ma-Dan/YOLOv3-CoreML.

MobileNetV2+SSDLite gets up to 90 FPS.

You wouldn't actually run these models at those super high speeds but it shows that they're more than capable of running in real-time. Even on older devices, SSDLite gets real-time speeds.

(Granted, Faster R-CNN isn't really a speed monster on mobile.)

Let's say you need to achieve 30 FPS. That means you have 33 ms per frame to send the frame to the server, process it, and send the bounding boxes back to the client. Perhaps you can make it work fast enough for a single user but what if 1000s of users try to do this at once, how are you going to make the server fast enough then?

You'll have to queue up the requests and handle them in batches to get the best throughput, but that also increases the latency.

By doing this on a server you actually have three problems to solve: 1) making it fast enough, 2) making it scale, 3) paying for it.

I'm not saying it can't be done using a server, but don't write off on-device object detection too soon.