I need to distribute a video stream from a live source to several clients with the additional requirement that each frame is identifiable across all clients.
I have already done research into the topic, and I have arrived at a possible solution that I can share. My solution seems suboptimal and this is my first experience of working with video streams, so I want to see if somebody knows a better way.
The reason why I need to be able to identify specific frames within the video stream is that the streaming clients need to be able to talk about the time differences between events each of them identifies in their video stream.
I want to enable the following interaction:
Dewey cannot send the frame to Stevie directly, because Malcolm and Reese also want to tell him about specific video frames and Stevie is interested in the time difference between their findings.
The solution that I found was using ffserver to broadcast a RTP stream and use the timestamps from the RTCP packets to identify frames. These timestamps are normally used to synchronize audio and video, and not to provide a shared timeline across several clients, which is why I am skeptical this is the best way to solve my problem.
It also seems beneficial to have frame numbers, like an increasing counter of frames instead of arbitrary timestamps which increase by some perhaps varying offset as for my application I also have to reference neighboring frames and it seems easier to compute time differences from frame numbers, than the other way around.
We ended up not going through with the project to completion and sadly I can't provide any source code, but we conceptually developed two solutions which might be useful for others who are solving the same problem.
The first solution is a minimum effort solution to achieve the desired goal, while the second solution is a more flexible design, which leverages RTCP to support various video formats.
quick and dirty
You start from an existing implementation for an MJPEG stream or some similar fairly simple video codec with self-contained frames that you have the source code for and put a lossless transport layer underneath that format (like TCP or HTTP).
1) You add a single function your video codec implementation, which can produce a hash, like SHA1, from the image data of a frame.
2) You add a (persistent) Map to your server implementation, let's call it framemap that takes your hashes as keys, and returns an integer as it's value.
3) When you encode your video to your output format on the server, compute the hash of every frame and put it into the framemap with an incrementing integer that identifies the frame.
4) You add some additional API to your server, where a client can give you a hash, and you look it up in the framemap and return the corresponding increasing frame number.
5) On the client, if you want to know the incrementing frame number, you compute the hash of the frame, ask the server API about the frame hash, and it sends you back the increasing frame number.
In this design you only add the hashing functionality somewhere in the video codec, and tack everything else on in another place.
clean design
This relies on the RTP protocol and it's RTCP control stream. Each RTP packet has a timestamp, which signifies the intended presentation time of the contained frame, but it's start value is random, so you need to look at the RTCP control stream, which gives you an NTP timestamp of the server with a corresponding presentation time. From this you should be able to compute fairly precise timestamps all based on the NTP clock of the server. We tried to add functionality that supports this to VLC, which turned out to be fairly hard, because VLC has a fairly complicated codebase that pulls together a lot of code from different places. So maybe you want to extend a simpler implementation here, depending on your requirements.
Take a look at RFC 2326 – Chapter 3.6 Normal Play Time and Chapter 3.7 Absolute Play Time for this approach.