python opencv webcam video-capture low-latency

Lower latency from webcam cv2.VideoCapture

I'm building an application using the webcam to control video games (kinda like a kinect). It uses the webcam (cv2.VideoCapture(0)), AI pose estimation (mediapipe), and custom logic to pipe inputs into dolphin emulator.

The issue is the latency. I've used my phone's hi-speed camera to record myself snapping and found latency of around 32 frames ~133ms between my hand and the frame onscreen. This is before any additional code, just a loop with video read and cv2.imshow (about 15ms)

Is there any way to decrease this latency?

I'm already grabbing the frame in a separate Thread, setting CAP_PROP_BUFFERSIZE to 0, and lowering the CAP_PROP_FRAME_HEIGHT and CAP_PROP_FRAME_WIDTH, but I still get ~133ms of latency. Is there anything else I can be doing?

Here's my code below:

class WebcamStream:
    def __init__(self, src=0):
        self.stopped = False

        self.stream = cv2.VideoCapture(src)
        self.stream.set(cv2.CAP_PROP_BUFFERSIZE, 0)
        self.stream.set(cv2.CAP_PROP_FRAME_HEIGHT, 400)
        self.stream.set(cv2.CAP_PROP_FRAME_WIDTH, 600)

        (self.grabbed, self.frame) = self.stream.read()
    
        self.hasNew = self.grabbed
        self.condition = Condition()

    def start(self):

        Thread(target=self.update, args=()).start()
        return self

    def update(self,):
        while True:
            if self.stopped: return
            
            (self.grabbed, self.frame) = self.stream.read()
            with self.condition:
                self.hasNew = True
                self.condition.notify_all()
            

    def read(self):
        if not self.hasNew:
            with self.condition:
                self.condition.wait()

        self.hasNew = False
        return self.frame

    def stop(self):
        self.stopped = True

The application needs to run in as close to real time as possible, so any reduction in latency, no matter how small would be great. Currently between the webcam latency (~133ms), pose estimation and logic (~25ms), and actual time it takes to move into the correct pose, it racks up to about 350-400ms of latency. Definitely not ideal when I'm trying to play a game.

EDIT: Here's the code I used to test the latency (Running the code on my laptop, recording my hand and screen, and counting frame difference in snapping):

if __name__ == "__main__":
    cap = WebcamStream().start()
    while(True):
        frame = cap.read()
        cv2.imshow('frame', frame)
        cv2.waitKey(1)

Solution

Welcome to the War-on-Latency _{( shaving-off )}

The experience you have described above is a bright example, how accumulated latencies could devastate any chances to keep a control-loop tight-enough, to indeed control something meaningfully stable, as in a MAN-to-MACHINE-INTERFACE system we wish to keep:

_{User's-motion | CAM-capture | IMG-processing | GUI-show | User's-visual-cortex-scene-capture | User's decision+action | loop}

A real-world situation, where OpenCV profiling was shown, to "sense" how much time we spend in respective acquisition-storage-transformation-postprocessing-GUI pipeline actual phases _{( zoom in as needed )}

What latency-causing steps do we work with?

Forgive, for a moment, a raw-sketch of where we accumulate each of the particular latency-related costs :


   CAM \____/                                     python code GIL-awaiting ~ 100 [ms] chopping
        |::|                                      python code calling a cv2.<function>()
        |::|   __________________________________________-----!!!!!!!-----------
        |::|    ^     2x                                 NNNNN!!!!!!!MOVES DATA!
        |::|    | per-call                               NNNNN!!!!!!!    1.THERE
        |::|    |     COST                               NNNNN!!!!!!!    2.BACK
        |::|    |           TTTT-openCV::MAT into python numpy.array
        |::|    |          ////       forMAT TRANSFORMER TRANSFORMATIONS
        USBx    |         ////                           TRANSFORMATIONS
        |::|    |        ////                            TRANSFORMATIONS
        |::|    |       ////                             TRANSFORMATIONS
        |::|    |      ////                              TRANSFORMATIONS
        |::|    |     ////                               TRANSFORMATIONS
    H/W oooo   _v____TTTT in-RAM openCV::MAT storage     TRANSFORMATIONS
       /    \        oooo ------ openCV::MAT object-mapper
       \    /        xxxx
 O/S--- °°°°         xxxx
 driver """" _____   xxxx
         \\\\    ^   xxxx ...... openCV {signed|unsigned}-{size}-{N-channels}
 _________\\\\___|___++++ __________________________________________
 openCV I/O      ^   PPPP                                 PROCESSING
            as F |   ....                                 PROCESSING
               A |   ...                                  PROCESSING
               S |   ..                                   PROCESSING
               T |   .                                    PROCESSING
            as   |   PPPP                                 PROCESSING
      possible___v___PPPP _____ openCV::MAT NATIVE-object PROCESSING

What latencies do we / can we fight _{( here )} against?

Hardware latencies could help, yet changing already acquired hardware could turn expensive

Software latencies of already latency-optimised toolboxes is possible, yet harder & harder

Design inefficiencies are the final & most common place, where latencies could get shaved-off

OpenCV ?
There is not much to do here. The problem is with the OpenCV-Python binding details:

_{... So when you call a function, say res = equalizeHist(img1,img2) in Python, you pass two numpy arrays and you expect another numpy array as the output. So these numpy arrays are converted to cv::Mat and then calls the equalizeHist() function in C++. Final result, res will be converted back into a Numpy array. So in short, almost all operations are done in C++ which gives us almost same speed as that of C++.}

This works fine "outside" a control-loop, not in our case, where both of the two transport-costs, transformation-costs and any of new or interim-data storage RAM-allocation-costs result in worsening our control-loop TAT.

So avoid any and all calls of OpenCV-native functions from Python-(behind the bindings' latency extra-miles)-side, no matter how tempting or sweet these may look on the first sight.

HUNDREDS-of-[ms] are a rather bitter cost of ignoring this advice.

Python ?
Yes, Python. Using Python interpreter introduces both latency per se, plus adds problems with concurrency-avoided processing, no matter how many cores does our hardware operate on ( while recent Py3 tries a lot to lower these costs under the interpreter-level software).

We can test & squeeze max out of the (still unavoidable, in 2022) GIL-lock interleaving - check the sys.getswitchinterval() and test increasing this amount for having less interleaved python-side processing ( tweaking is dependent on other your python-application ambitions ( GUI, distributed-computing tasks, python network-I/O workloads, python-HW-I/O-s, if applicable, etc )

RAM-memory-I/O costs ?
Our next major enemy. Using a least-sufficient-enough image-DATA-format, that MediaPipe can work with is the way forward in this segment.

Avoidable losses
All other (our) sins belong to this segment. Avoid any image-DATA-format transformations ( see above, cost may easily grow into HUNDREDS THOUSANDS of [us] just for converting an already acquired-&-formatted-&-stored numpy.array into just another colourmap)

MediaPipe
lists enumerated formats it can work with:

 // ImageFormat

  SRGB: sRGB, interleaved:   one byte for R,
                        then one byte for G,
                        then one byte for B for each pixel.

  SRGBA: sRGBA, interleaved: one byte for R,
                             one byte for G,
                             one byte for B,
                             one byte for alpha or unused.

  SBGRA: sBGRA, interleaved: one byte for B,
                             one byte for G,
                             one byte for R,
                             one byte for alpha or unused.

  GRAY8:        Grayscale,   one byte per pixel.

  GRAY16:       Grayscale,   one uint16 per pixel.

  SRGB48:  sRGB,interleaved, each component is a uint16.

  SRGBA64: sRGBA,interleaved,each component is a uint16.

  VEC32F1:                   One float per pixel.

  VEC32F2:                   Two floats per pixel.

So, choose the MVF -- the minimum viable format -- for gesture-recognition to work and downscale the amount of pixels as possible ( 400x600-GRAY8 would be my hot candidate )

Pre-configure ( not missing the cv.CAP_PROP_FOURCC details ) the native-side OpenCV::VideoCapture processing to do no more than just plain storing this MVF in a RAW-format on the native-side of the Acquisition-&-Pre-processing chain, so that no other post-process formatting takes place.

If indeed forced to ever touch the python-side numpy.array object, prefer to use vectorised & striding-tricks powered operations over .view()-s or .data-buffers, so as to avoid any unwanted add-on latency costs increasing the control-loop TAT.

Options?

eliminate any python-side calls ( as these cost you --2x--the-costs of data-I/O + transformation costs ) by precisely configuring the native-side OpenCV processing to match the needed MediaPipe data-format
minimise, better avoid any blocking, if still too skewed control-loop, try using distributed-processing with moving raw-data into other process ( not necessarily a Python-interpreter ) on localhost or within a sub-ms LAN domain ( further tips available here )
try to fit the hot-DATA RAM-footprints to match you CPU-Cache Hierarchy cache-lines' sizing & associativity details ( see this )

Lower latency from webcam cv2.VideoCapture

Welcome to the War-on-Latency ( shaving-off )

What latency-causing steps do we work with?

What latencies do we / can we fight ( here ) against?

Options?

Welcome to the War-on-Latency _{( shaving-off )}

What latencies do we / can we fight _{( here )} against?