I'm building an application using the webcam to control video games (kinda like a kinect). It uses the webcam (cv2.VideoCapture(0)), AI pose estimation (mediapipe), and custom logic to pipe inputs into dolphin emulator.
The issue is the latency. I've used my phone's hi-speed camera to record myself snapping and found latency of around 32 frames ~133ms between my hand and the frame onscreen. This is before any additional code, just a loop with video read
and cv2.imshow
(about 15ms)
Is there any way to decrease this latency?
I'm already grabbing the frame in a separate Thread, setting CAP_PROP_BUFFERSIZE to 0, and lowering the CAP_PROP_FRAME_HEIGHT and CAP_PROP_FRAME_WIDTH, but I still get ~133ms of latency. Is there anything else I can be doing?
Here's my code below:
class WebcamStream:
def __init__(self, src=0):
self.stopped = False
self.stream = cv2.VideoCapture(src)
self.stream.set(cv2.CAP_PROP_BUFFERSIZE, 0)
self.stream.set(cv2.CAP_PROP_FRAME_HEIGHT, 400)
self.stream.set(cv2.CAP_PROP_FRAME_WIDTH, 600)
(self.grabbed, self.frame) = self.stream.read()
self.hasNew = self.grabbed
self.condition = Condition()
def start(self):
Thread(target=self.update, args=()).start()
return self
def update(self,):
while True:
if self.stopped: return
(self.grabbed, self.frame) = self.stream.read()
with self.condition:
self.hasNew = True
self.condition.notify_all()
def read(self):
if not self.hasNew:
with self.condition:
self.condition.wait()
self.hasNew = False
return self.frame
def stop(self):
self.stopped = True
The application needs to run in as close to real time as possible, so any reduction in latency, no matter how small would be great. Currently between the webcam latency (~133ms), pose estimation and logic (~25ms), and actual time it takes to move into the correct pose, it racks up to about 350-400ms of latency. Definitely not ideal when I'm trying to play a game.
EDIT: Here's the code I used to test the latency (Running the code on my laptop, recording my hand and screen, and counting frame difference in snapping):
if __name__ == "__main__":
cap = WebcamStream().start()
while(True):
frame = cap.read()
cv2.imshow('frame', frame)
cv2.waitKey(1)
The experience you have described above is a bright example, how accumulated latencies could devastate any chances to keep a control-loop tight-enough, to indeed control something meaningfully stable, as in a MAN-to-MACHINE-INTERFACE system we wish to keep:
User's-motion | CAM-capture | IMG-processing | GUI-show | User's-visual-cortex-scene-capture | User's decision+action | loop
A real-world situation, where OpenCV profiling was shown, to "sense" how much time we spend in respective acquisition-storage-transformation-postprocessing-GUI pipeline actual phases ( zoom in as needed )
Forgive, for a moment, a raw-sketch of where we accumulate each of the particular latency-related costs :
CAM \____/ python code GIL-awaiting ~ 100 [ms] chopping
|::| python code calling a cv2.<function>()
|::| __________________________________________-----!!!!!!!-----------
|::| ^ 2x NNNNN!!!!!!!MOVES DATA!
|::| | per-call NNNNN!!!!!!! 1.THERE
|::| | COST NNNNN!!!!!!! 2.BACK
|::| | TTTT-openCV::MAT into python numpy.array
|::| | //// forMAT TRANSFORMER TRANSFORMATIONS
USBx | //// TRANSFORMATIONS
|::| | //// TRANSFORMATIONS
|::| | //// TRANSFORMATIONS
|::| | //// TRANSFORMATIONS
|::| | //// TRANSFORMATIONS
H/W oooo _v____TTTT in-RAM openCV::MAT storage TRANSFORMATIONS
/ \ oooo ------ openCV::MAT object-mapper
\ / xxxx
O/S--- °°°° xxxx
driver """" _____ xxxx
\\\\ ^ xxxx ...... openCV {signed|unsigned}-{size}-{N-channels}
_________\\\\___|___++++ __________________________________________
openCV I/O ^ PPPP PROCESSING
as F | .... PROCESSING
A | ... PROCESSING
S | .. PROCESSING
T | . PROCESSING
as | PPPP PROCESSING
possible___v___PPPP _____ openCV::MAT NATIVE-object PROCESSING
Hardware latencies could help, yet changing already acquired hardware could turn expensive
Software latencies of already latency-optimised toolboxes is possible, yet harder & harder
Design inefficiencies are the final & most common place, where latencies could get shaved-off
OpenCV ?
There is not much to do here. The problem is with the OpenCV-Python binding details:
... So when you call a function, say
res = equalizeHist(img1,img2)
in Python, you pass two numpy arrays and you expect anothernumpy
array as the output. So these numpy arrays are converted tocv::Mat
and then calls theequalizeHist()
function in C++. Final result, res will be converted back into a Numpy array. So in short, almost all operations are done in C++ which gives us almost same speed as that of C++.
This works fine "outside" a control-loop, not in our case, where both of the two transport-costs, transformation-costs and any of new or interim-data storage RAM-allocation-costs result in worsening our control-loop TAT
.
So avoid any and all calls of OpenCV-native functions from Python-(behind the bindings' latency extra-miles)-side, no matter how tempting or sweet these may look on the first sight.
HUNDREDS-of-[ms]
are a rather bitter cost of ignoring this advice.
Python ?
Yes, Python. Using Python interpreter introduces both latency per se, plus adds problems with concurrency-avoided processing, no matter how many cores does our hardware operate on ( while recent Py3 tries a lot to lower these costs under the interpreter-level software).
We can test & squeeze max out of the (still unavoidable, in 2022) GIL-lock interleaving - check the sys.getswitchinterval()
and test increasing this amount for having less interleaved python-side processing ( tweaking is dependent on other your python-application ambitions ( GUI, distributed-computing tasks, python network-I/O workloads, python-HW-I/O-s, if applicable, etc )
RAM-memory-I/O costs ?
Our next major enemy. Using a least-sufficient-enough image-DATA-format, that MediaPipe can work with is the way forward in this segment.
Avoidable losses
All other (our) sins belong to this segment. Avoid any image-DATA-format transformations ( see above, cost may easily grow into HUNDREDS THOUSANDS of [us]
just for converting an already acquired-&-formatted-&-stored numpy.array
into just another colourmap)
MediaPipe
lists enumerated formats it can work with:
// ImageFormat
SRGB: sRGB, interleaved: one byte for R,
then one byte for G,
then one byte for B for each pixel.
SRGBA: sRGBA, interleaved: one byte for R,
one byte for G,
one byte for B,
one byte for alpha or unused.
SBGRA: sBGRA, interleaved: one byte for B,
one byte for G,
one byte for R,
one byte for alpha or unused.
GRAY8: Grayscale, one byte per pixel.
GRAY16: Grayscale, one uint16 per pixel.
SRGB48: sRGB,interleaved, each component is a uint16.
SRGBA64: sRGBA,interleaved,each component is a uint16.
VEC32F1: One float per pixel.
VEC32F2: Two floats per pixel.
So, choose the MVF -- the minimum viable format -- for gesture-recognition to work and downscale the amount of pixels as possible ( 400x600-GRAY8
would be my hot candidate )
Pre-configure ( not missing the cv.CAP_PROP_FOURCC
details ) the native-side OpenCV::VideoCapture
processing to do no more than just plain storing this MVF in a RAW-format on the native-side of the Acquisition-&-Pre-processing chain, so that no other post-process formatting takes place.
If indeed forced to ever touch the python-side numpy.array
object, prefer to use vectorised & striding-tricks powered operations over .view()
-s or .data
-buffers, so as to avoid any unwanted add-on latency costs increasing the control-loop TAT
.
eliminate any python-side calls ( as these cost you --2x--the-costs of data-I/O + transformation costs ) by precisely configuring the native-side OpenCV processing to match the needed MediaPipe data-format
minimise, better avoid any blocking, if still too skewed control-loop, try using distributed-processing with moving raw-data into other process ( not necessarily a Python-interpreter ) on localhost or within a sub-ms LAN domain ( further tips available here )
try to fit the hot-DATA RAM-footprints to match you CPU-Cache Hierarchy cache-lines' sizing & associativity details ( see this )