Search code examples
ffmpegh.264hardware-accelerationlibavmotion-detection

How to extract motion vectors from h264 without a full decode on the CPU


I'm trying to use my nose as a pointing device. The plan is to encode the video stream from a webcam pointed at my face as h264 or the like, get the motion vectors, cook the numbers a bit and chuck them into /dev/uinput to make the mouse pointer move about. The uinput bit was easy.

This has to work with zero discernable latency. This, for instance:

#!/bin/bash
[ -p pipe.mkv ] || mkfifo pipe.mkv
ffmpeg -y -rtbufsize 1M -s 640x360 -vcodec mjpeg -i /dev/video0 -c h264_nvenc pipe.mkv &
ffplay -flags2 +export_mvs -vf codecview=mv=pf+bf+bb pipe.mkv

shows that the vectors are there but with a latency of several seconds which is unusable in a mouse. I know that the first ffmpeg step is working very fast by using the GPU, so either the pipe or the h264 decode in the second step is introducing the latency.

I tried MV Tractus (same as mpegflow I think) in a similar pipe arrangement and it was also very slow. They do a full h264 decode on the CPU and I think that's the problem cos I can see them imposing a lot of load on one CPU. If the pipe had caused the delay by buffering badly then the CPU wouldn't have been loaded. I guess ffplay also did the decoding on the CPU and I couldn't persuade it not to, but it only wants to draw arrows which are no use to me.

I think there are several approaches, and I'd like advice on which would be best, or if there's something even better I don't know about. I could:

  • Decode in hardware and get the motion vectors. So far this has failed. I tried combining ffmpeg's extract_mvs.c and hw_decode.c samples but no motion vectors turn up. vdpau is the only decoder I got working on my linux box. I have a nvidia gpu.
  • Do a minimal parse of the h264 to fish out the motion vectors only, ignoring all the other data. I think this would mean putting some kind of "motion only" option in libav's parser, but I'm not at all familiar with that code.
  • Find some other h264 parsing library that has said option and also unpacks the container.
  • Forget about hardware accelerated encoding and use a stripped down encoder to make only the motion vectors on either CPU or GPU. I suspect this would be slow cos I think calculating the motion vectors is the hardest part of the algorithm.

I'm tending towards the second option but I need some help figuring out where in the libav code to do it.


Solution

  • Very interesting project! I'm no ffmpeg expert, but it looks to me like your ffmpeg command is decoding the mjpeg output of your /dev/video0 and then ENCODING it into h.264 to get the motion vectors. That h.264 encoding step is computationally intensive and is likely causing your latency. Some things you can do to speed it up are (a) use a webcam that outputs h.264 instead of mjpeg; (b) run the h.264 encode on faster hardware and (c) use ffmpeg to lower the resolution of your video stream before encoding it. For example, you could define a small "hot region" in the video camera where the motions of your nose can control the mouse.