I have a simple API stream generator in FastAPI that streams audio chunks (i.e., wav) on the fly due to the expensive processing of the whole data. However, the audio doesn't start right away unless some larger chunks (i.e., more than 5-10s) are streamed or the connection is closed by stopping the server.
<audio
ref={audioRef}
controls={true}
autoPlay={true}
onCanPlayThrough={() => {
console.log("Can play through.");
setIsLoaded(true);
}}
onError={(err) => {
console.log("Error loading.");
console.error(err);
}}
onLoadedData={(event) => {
console.log("Loaded data.");
console.log(event);
}}
onLoadedMetadata={(event) => {
console.log("Loaded metadata.");
console.log(event);
}}
onWaiting={(event) => {
console.log("Audio is waiting for more data.");
console.log(event);
}}
>
<source id="source" src="/audio" type="audio/wav"/>
Your browser does not support the audio element.
</audio>
This is the response, when first chunk is yielded:
HTTP/1.1 200 OK
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: *
Access-Control-Allow-Headers: *
date: Fri, 05 Jan 2024 20:04:48 GMT
server: uvicorn
content-type: audio/wave
connection: close
transfer-encoding: chunked
Moreover, there is no event fired from the tag's HTMLMediaElement unless as described above some larger chunks are streamed or the connection is closed by stopping the server, and the actual playback starts only then.
Also tested with:
curl -m 15 -o output.wav http://localhost:5000/audio
that works as expected.
Are there any limitations to regarding how much buffering is required until autoPlay kicks in, or is there something specific that needs to be done? The chunks streamed are 3-4s long, and of course the first frame contains the metadata for decoding. I would like instant flushing of the first chunk and immediate playback. However, that's not the case somehow, and I probably will have to rely on some lower-level interface to manually play each chunk as they come for a fully real-time playback of contiguous audio data.
Yeah, the browser is going to buffer until it's confident it can play the stream without dropouts. This might be a bit dated, but in Chromium there is/used to be:
Fixed-Size Buffer for Sniffing -- Until like ~8 KB of data was sent, the browser wouldn't even try to decode because it needed to 'sniff' the type, and some data was needed for that. Obviously if you're using straight PCM you'll fill that buffer quickly, but there is potentially another problem here in that the browser doesn't know the codec you're sending. WAVE files can contain a number of codecs. Even though it's almost always LPCM, it could be something else, so maybe you're hitting an issue with that.
'Network' Buffering -- This is the more practical buffer, to ensure a smooth stream. The faster you fill this buffer, the sooner your playback will begin. If on connect, your server sits around and doesn't send any data right away, then the browser might think it needs to buffer longer. You mention that you're doing things in 3-4 second chunks... that's actually probably your issue. If you send a few seconds, but then stop sending data until another 3-4 seconds, the browser doesn't know it can expect that data until it receives it.
You should pre-buffer where possible and flush that data to the client immediately upon connect, and keep sending it data as you get it, in a streaming fashion, rather than in chunks. You can always lower the latency from the prebuffering later on by increasing the playback rate by a fraction until the live play head catches up to the buffered data. Just note that the browser is buffering for a reason... and you don't want to break the users on high latency connections, so don't push it too far.
and I probably will have to rely on some lower-level interface to manually play each chunk as they come for a fully real-time playback of contiguous audio data
Nah, the browser essentially does this for you. In any case, you definitely do not want to try to play chunks one-by-one manually. You'll never get them aligned. They minimally need to be buffered and played as one big stream. Even if you could align them, you'll have power saving devices kick in and drop your playback because nothing was playing for a bit here and there. Worst case scenario, you can use MediaSource Extensions (with some codecs/containers) to get more control, but you don't really need that here for your use case.