Search code examples
javascriptstreamingaudio-streamingweb-audio-api

Strange micro-delays while playing queue of audioBufferSourceNodes


I'll start with an idea. I want to make a mechanism that allows me to load audio with chunks of 6144 bytes each. Then, I want to play all chunks from array where I store all those file chunks.

When it's time to play audioBufferSourceNodes, I have some strange delay, and have no idea, how to fix it.


Those chunks of audio file I got from my websocket server written in Python, Django-Channels.

My vars:

const chatSocket = new WebSocket(...);
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var source;
var play = document.querySelector('#play');
var audioQueue = [];

Next, when I accept message with chunk from server, I decode it and put it in queue:

chatSocket.onmessage = function(e) {
    e.data.arrayBuffer().then(buffer => {
        audioCtx.decodeAudioData(buffer, (x)=>{
            source = audioCtx.createBufferSource();
            source.buffer = x;
            source.connect(audioCtx.destination);
            audioQueue.push(source);
        })
    })
}

Last thing is, to play all chunks I recieved. To do it, I use this part:

play.onclick = function() {
    var whenStart = 0;
    for (let audioBufferSourceNode of audioQueue) {
        whenStart = audioBufferSourceNode.buffer.duration + whenStart;
        audioBufferSourceNode.start(when=whenStart);
    }
}

That's all. Code above starts audio, it's great, but, as I wrote in title: strange micro-delays don't give me any peace.

This audio


Solution

  • Unfortunately, you can't stream this way. Lossy audio codecs don't always guarantee that they're going to start and end on a particular boundary. Even MP3 (which it looks like you're using) utilizes the a bit reservoir, which spreads data for a particular frame to unused space in other frames. You can't use decodeAudioData this way unless you were using a codec in which you could guarantee sample accuracy. The only codec I know you can do this with in-browser is regular PCM, or some lossless codec like FLAC.

    Assuming you do get the sample-accurate decoding going you still have a playback scheduling problem. The duration is in seconds, but this doesn't always nicely divide into the appropriate sample rate. A single sample at 44.1 kHz is 0.022675736961... milliseconds in duration. Without sample accuracy, you're not going to be able to time the chunks of playback correctly. Dropping samples here and there can be audible.

    So, what to do? There are several ways to solve the problem...

    Method 1: Use HTTP and an <audio> element

    The browser is very well capable of handling decoding audio data, buffering as appropriate, resampling, and everything needed for streaming playback. It does all this very efficiently. Rather than trying to reinvent this whole stack, let the browser deal with it.

    Set your Python server up to serve HTTP rather than WebSocket. Then, on your client this is as simple as:

    <audio src="https://example.com/your-python-script/perhaps-some-stream-id" preload="none"></audio>
    

    In your MP3 example, all you have to do is send the MP3 data with the appropriate Content-Type header of audio/mpeg. You don't even have to start on a frame boundary... the MPEG stream is self-syncing.

    Method 2: Use MediaSource Extensions

    If you must send your data via WebSocket, you can still let the browser do the decoding and playback. There are a lot of restrictions on the codecs and container formats you can use, but it does fit your use case.

    There is no easy example for this, so I'll link you to some documentation: https://developer.mozilla.org/en-US/docs/Web/API/MediaSource

    Method 3: Use [some decoder] and a ScriptProcessingNode or AudioWorkletNode

    The ScriptProcessingNode or AudioWorkletNode can be used to playback audio at the right time without dropping samples. If you can get your audio decoded to float32 PCM samples, and resampled to the current graph playback sample rate, then you can have this script node output the next chunk with the audioprocess event handler.

    Decoding and resampling are not trivial to do client-side, and can take a good deal of CPU. Therefore, this method is not usually recommended.

    Beyond these methods, there is also WebRTC which is more suitable for voice calls. Since you're streaming music, it's less recommended as it makes a tradeoff for realtime-ness over quality.