Search code examples
c#c++-clidirectshow

Media samples held in graph for a long time (accumulative effect)


Several months ago, I wrote this question, regarding buffer starvation on a DirectShow graph.

The starvation issue was solved by implementing a custom allocator that expands in size when starved. However, this merely mitigated the real problem; given enough time, the number of samples held in the graph becomes excessive and the ever expanding pool creates an out-of-memory situation.

Here are some facts I have managed to gather:

  1. The graph is basically transcoding an MPEG2-TS stream to an MP4 file, as well as extracting audio and video data for some realtime DSP processing.

  2. The stream comes as an UDP multicast stream. The stream is carrying 14 different SD programmes.

  3. I am reading the UDP stream using a custom filter derived from the DsNetwork example. Following the aforementioned example, a media sample (with NO timestamps) is created around the UDP received data block (an 8KiB block) and passed to Microsoft's MPEG2 Demultiplexer filter, that is configured to filter the program of interest. (Should I be timestamping the samples?)

  4. The filter that is requiring an expandable allocator is the MPEG2 Demultiplexer, in particular it is required for the samples delivered by the output video pin. The output audio pin works fine with a default allocator, no samples are retained by the audio decoder or the demuxer.

  5. The video samples are being decoded by LAV Video Decoder. Swapping the LAV filter to ffdshow filter has no positive effect - the accumulation is still present. I have found no setting either in LAV or ffdshow (including the sample queue settings) that alleviates the accumulation problem.

  6. The problem is completely related to the quality of the received stream. The more discontinuities detected on the stream (as flagged by the MPEG demuxer output samples), the more samples tend to be accumulated. Incidentally, running in parallel a VLC player consuming the same stream logs the same discontinuities, so they don't seem to be induced by buggy Network code on my part.

  7. The lingering samples are not lost, they are eventually processed by the graph. I wrote some watchdog logic to detect the possibility of lost samples and every sample is eventually properly released and returned to the pool.

  8. The lag is not related to CPU starvation. If I stop delivering samples to the demuxer, the demuxer stops delivering samples to the output pins. I NEED to push new samples into the demuxer for the lingering samples to be properly released and returned to the pool.

  9. I tried removing the clock from the capture graph, as well as from the muxer graphs (bridged by a GDCL bridge filter). This does not fix the problem and can actually block the data flow.

I have no idea if the samples are being held by the demultiplexer or by the video decoder. The truth is that I am completely clueless on how can I debug and hopefully fix this situation, and any pointers or suggestions are more than welcome.

Addendum:

I have some additional information:

  1. The transcoded video is lagging relative to the audio.
  2. The lag time is proportional to the amount of lingering samples.

So I think that at some point in the graph processing, the decoded audio and video sample timestamps get out of sync, and probably the muxer endpoint of the graph is blocking the video decoding thread, waiting for the corresponding audio to arrive.

Any hints on how can I detect the offending filter, or perhaps how can I "rebase" the syncing?

Addendum2:

As you can see in the comments on Roman's answer, I had actually found a bug that induced false discontinuities on the stream. By fixing that bug I reduced the number of incidences of the problem, yet I did not fix the root cause!

It turns out that the root of the problem was caused by the Monogram AAC encoder filter (at least the version I managed to get, as it seems the project is not supported anymore).

The encoder computes the output timestamps incrementally, by multiplying the amount of received samples by the sampling frequency of the input. The filter assumes that the data Flow is always continuous and does not even examine the incoming samples for discontinuities!. Fixing it was easy once I identified the problem, but this was indeed the hardest problem I had to debug in all my life as a developer, as all the problems pointed to the MPEG2 demuxer (the timestamps drifted between the encoded output audio and video pins and it was this filter that was running out of pooled samples in the first place), yet, this was caused indirectly by the worker thread of the video output pin being blocked at the end of the graph, by the MPEG4 muxer, that was receiving way out of sync samples between audio and video and was throttling the video input to try to keep things in sync.

Indeed the illusion of the filters being "black boxes" needs to be taken with caution, as the threads flow along the graph, and a problem on a downstream filter may manifest as false problem in an upstream filter.


Solution

  • First of all, the described behavior sounds like a bug. That is, unintended behavior causing unwanted effects. I agree, however, that attempts to work the problem around require identification of the offender and detailed investigation over registered problem.

    Since video is lagging relatively audio in amount correlating to lingering samples, and there is no other side effect (like lost frames, for example) I agree that the challenge is in finding who holds the media samples exactly.

    I can suggest two methods off the top of my head.

    Inspection of memory allocators

    This method is not so popular for the reasons I omit for brevity, however this still has good chances to not work. The background is that pin connections assume negotiation of memory allocator. The memory allocator is a private business of the pins so controlling application in most cases has no direct control (and eve access) over the data flow. More often each pin pair has their own allocator defines, however sometimes and not so rare multiple pin pairs use the same allocator. Note that it is output pin on a connection who has the final decision on the allocator to use.

    If you happen to be familiar with my DirectShowSpy tool, one of the things it does is enumeration of memory allocators:

    It can show the memory allocators, which connections share memory allocators and a snapshot of buffer count and free buffer count.

    For brevity reasons I omit the situations where this is inaccurate.

    Another important note is that this data is only available if you invoke spy UI from the process where DirectShow graph is running, as opposed to accessing filter graph remotely via Running Object Table.

    This means that you are supposed to do the following:

    1. register spy
    2. have your application running (with the filter graph)
    3. from the controlling thread (typically) IUnknown::QueryInterface for AlaxInfoDirectShowSpy::ISpy from your IGraphBuilder interface pointer
    4. do ISpy::DoPropertyFrameModal to show the UI in question

    You can obtain AlaxInfoDirectShowSpy::ISpy via #import of spy's type library. If spy is not registered via COM and it does not hook OS Filter Graph Manager object your QueryInterface in #3 above would fail.

    From C# code (as you tagged the question respectively) you can import DirectShowSpy.dll as a COM reference.

    Even though this method is not guaranteed to be working, it hasp good chances to show you the offender via visualization of memory allocator states and require some 10 lines of code to be inserted in your application.

    Adding a temporary diagnostic filter to trace pin connection communication

    Another method which has more chances to work out overall, but requires quite some writing code, is to develop and filter that transparently forwards data from input to output pin, such as CTransInPlaceFilter with logging media sample data somewhere to shared output. You might want to reuse GraphStudioNext's analyzer filter for this purpose in particular.

    The idea is to attach this filter eas early as on demultiplexer output pins and monitor/log data as it travels from the filter downstream. Comparing timestamps on separate legs as the data is streamed you should be able to detect the violator. If you see the lag monitoring demultiplexer output pin connections, then demultiplexer is the offender. If things go rather well there you would move the tracing downstream, esp. over decoders and isolate he offender as you move the tracing filter.

    Possible workarounds

    Once violator is identified you will have to think of tricking it into releasing the media samples it holds, which in turn might be a challenge on its own. Having no other helpful information at this point, I would prepare to somehow drain it on the go by either sending end of stream notification, or flushing, or using dynamic media type negotiation in order to eventually force it to drain its internal queue.