NAudio's BufferedWaveProvider gets full when recording and mixing an audio

I'm having an issue with a BufferedWaveProvider from NAudio library. I'm recording 2 audio devices (a microphone and a speaker), merge them into 1 stream and send it to an encoder (for a video).

To do this, I do the following:

Create a thread where I'll record the microphone using WasapiCapture.
Create a thread where I'll record the speakers audio using WasapiLookbackCapture. (I also use a SilenceProvider so I don't have gaps in what I record).
I'll want to mix these 2 audio so I have to make sure they have the same format, so I detect what's the best WaveFormat in all these audio devices. In my scenario, it's the speaker. So I decide that the Microphone audio will pass through a MediaFoundationResampler to adapt its format so it has the same than the one from the speaker.
Each audio chunks from the Wasapi(Lookback)Capture are sent to a BufferedWaveProvider.
Then, I also made a MixingSampleProvider where I pass the ISampleProvider from each recording thread. So I'm passing the MediaFoundationResampler for the Microphone, and BufferedWaveProvider for the Speakers.
In loop in a third thread, I read the data from the MixingSampleProvider, which is supposed to asynchronously empty the BufferedWaveProvider(s) while it's getting filled.
Because each buffer may not get filled exactly at the same time, I'm looking at what's the minimal common duration there is between these 2 buffers, and I'm reading this amount out of the mixing sample provider.
Then I enqueue what I read so my encoder, in a 4th thread, will treat it in parallel too.

Please see the flowchat below that illustrates my description above.

My problem is the following:

It works GREAT when recording the microphone and speaker for more than 1h while playing video game that uses the microphone too (for online multiplayer). No crash. The buffers are staying quite empty all the time. It's awesome.
But for some reason, every time I try my app during a Discord, Skype or Teams audio conversation, I immediately (within 5sec) crash on BufferedWaveProvider.AppSamples because the buffer gets full.

Looking at it in debug mode, I can see that:

The buffer corresponding to the speaker is almost empty. It has in average 100ms max of audio.
The buffer corresponding to the microphone (the one I resample) is full (5 seconds).

From what I read on NAudio's author's blog, the documentation and on StackOverflow, I think I'm doing the best practice (but I can be wrong), which is writing in the buffer from a thread, and reading it in parallel from another one. There is of course a risk that it's getting filled faster than I read it, and it's basically what's happening right now. But I'm not understanding why.

Help needed

I'd like some help to understand what I'm missing here, please. The following points are confusing me:

Why does this issue happens only with Discord/Skype/Teams meetings? The video games I'm using are using the microphone too, so I can't imagine it's something like another app is preventing the microphone/speakers to works correctly.
I synchronize the startup of both audio recorder. Do to this, I'm using a signal to ask the recorders to starts, and when they all started to generate data (through DataAvailable event), I send a signal to tell them to fill in the buffers with what they will receive in the next event. It's probably not perfect because both audio devices send their DataAvailable at different times, but we're talking about 60ms of difference maximum (on my machine), not 5 seconds. So I don't understand why it's getting filled.
To bounce on what I said in #2, my telemetry shows that the buffer is getting filled this way (values are dummy):

Microphone buffered duration: 0ms | Speakers: 0ms
Microphone buffered duration: 60ms | Speakers: 60ms
Microphone buffered duration: 0ms | Speakers: 0ms <= That's because I read the data from the mixing sample provider
Microphone buffered duration: 60ms | Speakers: 0ms <= Events may not be in sync, that's ok.
Microphone buffered duration: 120ms | Speakers: 60ms <= Alright, next loop, I'll extract 60ms on each buffer.
Microphone buffered duration: 390ms | Speakers: 0ms <= Wait, how?
Microphone buffered duration: 390ms | Speakers: 60ms
[...]
Microphone buffered duration: 5000ms | Speakers: 0ms <= Oh no :(

So it appears that the buffer of the Microphone is getting filled faster... But why? Can it be because the resampler slows down the read of the Microphone's buffer? If so, it should also slow down the read of the Speaker's buffer since I'm reading it through a MixingSampleProvider, isn't it?

Here is a simplified extract of my code if that can help:


/* THREAD #1 AND #2 */

_audioCapturer = new WasapiCapture(_device); // Or WasapiLookbackCapture + SilenceProvider playing
_audioCapturer.DataAvailable += AudioCapturer_DataAvailable;

// This buffer can host up to 5 second of audio, after that it crashed when calling AddSamples.
// So we should make sure we don't store more than this amount.
_waveBuffer = new BufferedWaveProvider(_audioCapturer.WaveFormat)
{
    DiscardOnBufferOverflow = false,
    ReadFully = false
};

if (DoINeedToResample)
{
    // Create a resampler to adapt the audio to the desired wave format.
    // In my scenario explained above, this happens for the Microphone.
    _resampler = new MediaFoundationResampler(_waveBuffer, targettedWaveFormat);
}
else
{
    // No conversion is required.
    // In my scenario explained above, this happens for the Speakers.
    _resampler = _waveBuffer;
}

        private void AudioCapturer_DataAvailable(object? sender, WaveInEventArgs e)
        {
            NotifyRecorderIsReady();
            if (!AllRecorderAreReady)
            {
                // Don't record the frame unless every other recorders have started to record too.
                return;
            }

            // Add the captured sample to the wave buffer.
            _waveBuffer.AddSamples(e.Buffer, 0, e.BytesRecorded);

            // Notify the "mixer" that a chunk has been recorded.

        }

/* The Mixer, in another class */


_waveProvider = new MixingSampleProvider(_allAudioRecoders.Select(r => r._resampler));
_allAudioRecoders.ForEach(r => r._audioCapturer.StartRecording());

Task _mixingTask = Task.CompletedTask;

        private void OnChunkAddedToBufferedWaveProvider()
        {
            if (_mixingTask.IsCanceled
                || _mixingTask.IsCompleted
                || _mixingTask.IsFaulted
                || _mixingTask.IsCompletedSuccessfully)
            {
                // Treat the buffered audio in parallel.

                _mixingTask = Task.Run(() =>
                {
                    /* THREAD #3 */
                    lock (_lockObject)
                    {
                        TimeSpan minimalBufferedDuration;
                        do
                        {
                            // Gets the common duration of sound that all audio recorder captured.
                            minimalBufferedDuration = _allAudioRecoders.OrderBy(t => t._waveBuffer.Ticks).First().BufferedDuration;

                            if (minimalBufferedDuration.Ticks > 0)
                            {
                                // Read a sample from the mixer.
                                var bufferLength = minimalBufferedDuration.TotalSeconds * _waveProvider!.WaveFormat.AverageBytesPerSecond;
                                var data = new byte[(int)bufferLength];
                                var readData = _waveProvider.Read(data, 0, data.Length);

                                // Send the data to a queue that will be treated in parallel by the encoder.
                            }
                        } while (minimalBufferedDuration.Ticks > 0);
                    }
                });
            }
        }

Does anyone have an idea of what I'm doing wrong and/or why this reproduce only when chatting by voice on Discord/Skype/Teams and not through online multiplayer games?

Thanks in advance!

[UPDATE] 2/9/2021

I may have found the issue, but I'm not 100% sure of how to handle it. It seems like I stop receiving data from the Microphone, and therefore the speaker buffer is getting full. (it seems like yesterday, it was the opposite).

[UPDATE] 2/12/2021

It sounds like that for some reason, maybe (and I say maybe because the issue could be something else) the BufferedWaveProvider doesn't clear itself after reading in some scenarios.

What makes me think of that is the following:

Before reading the MixingSampleProvider I log how much buffered duration in each buffer we have.
And I log it after reading too.
Most of the time, it's great, I get constant data showing the following pattern for dozen of minutes, or even an hour:

BEFORE READING MICROPHONE: 20ms
BEFORE READING SPEAKER: 10ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 0ms

// I don't explain why both buffer are empty considering my algorithm was supposed to read only 10ms, but the output MP4 seems fine and in sync, so it's fine? ...

And then suddenly one of the buffer will get filled in 5 sec.

BEFORE READING MICROPHONE: 20ms
BEFORE READING SPEAKER: 10ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 0ms
BEFORE READING MICROPHONE: 10ms
BEFORE READING SPEAKER: 20ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 20ms
BEFORE READING MICROPHONE: 20ms
BEFORE READING SPEAKER: 30ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 30ms
BEFORE READING MICROPHONE: 10ms
BEFORE READING SPEAKER: 50ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 50ms
BEFORE READING MICROPHONE: 20ms
BEFORE READING SPEAKER: 70ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 70ms
BEFORE READING MICROPHONE: 20ms
BEFORE READING SPEAKER: 80ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 80ms
BEFORE READING MICROPHONE: 10ms
BEFORE READING SPEAKER: 100ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 100ms
BEFORE READING MICROPHONE: 20ms
BEFORE READING SPEAKER: 110ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 110ms
BEFORE READING MICROPHONE: 10ms
BEFORE READING SPEAKER: 130ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 130ms
[...]
BEFORE READING MICROPHONE: 20ms
BEFORE READING SPEAKER: 4970ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 4970ms
BEFORE READING MICROPHONE: 10ms
BEFORE READING SPEAKER: 4980ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 4980ms
BEFORE READING MICROPHONE: 20ms
BEFORE READING SPEAKER: 5000ms
AFTER READING MICROPHONE: 0ms
AFTER READING SPEAKER: 5000ms
<!-- Crash -->

I could do a dirty fix by clearing the buffer when it start being not sync up anymore, but I'd really want to understand why does this happens and if there is a better approach to workaround it.

Thank you

[UPDATE] #2

OK I think I isolated the issue. This may be a bug in NAudio library. Here is what I did:

Play my program as usual.
When one of the buffers reaches 5 sec (aka. gets full), stop filling that specific buffer.
By doing this, I end up in a situation where the buffer of 1 device is getting filled in, the buffer of the other device is not, but I keep reading these buffers when I can.
And here is what I found out: It appears that the size of buffer that got full never reduce after reading, which explains why suddenly it gets full. It is unfortunately inconsistent and can't explain why.

Solution

Following more investigations and a post on GitHub: https://github.com/naudio/NAudio/issues/742

I found out that I should listen to the MixingSampleProvider.MixerInputEnded event and readd the SampleProvider to the MixingSampleProvider when it happens.

The reason why it happens is that I'm treating the audio while capturing it, and there are some moments where I may treat it faster than I record it, therefore the MixingSampleProvider considers it has nothing more to read and stops. So I should tell it that no, it's not over and it should expect more.