Search code examples
node.jsaudioconcatenationbuffermp3

Why can I sometimes concatenate audio data using NodeJS Buffers, and sometimes I cannot?


As part of a project I am working on, there is a requirement to concatenate multiple pieces of audio data into one large audio file. The audio files are generated from four sources, and the individual files are stored in a Google Cloud storage bucket. Each file is an mp3 file, and it is easy to verify that each individual file is generating correctly (individually, I can play them, edit them in my favourite software, etc.).

To merge the audio files together, a nodejs server loads the files from the Google Cloud storage as an array buffer using an axios POST request. From there, it puts each array buffer into a node Buffer using Buffer.from(), so now we have an array of Buffer objects. Then it uses Buffer.concat() to concatenate the Buffer objects into one big Buffer, which we then convert to Base64 data and send to the client server.

This is cool, but the issue arises when concatenating audio generated from different sources. The 4 sources I mentioned above are Text to Speech software platforms, such as Google Cloud Voice and Amazon Polly. Specifically, we have files from Google Cloud Voice, Amazon Polly, IBM Watson, and Microsoft Azure Text to Speech. Essentially just five text to speech solutions. Again, all individual files work, but when concatenating them together via this method there are some interesting effects.

When the sound files are concatenated, seemingly depending on which platform they originate from, the sound data either will or will not be included in the final sound file. Below is a 'compatibility' table based on my testing:

|------------|--------|--------|-----------|-----|
| Platform / | Google | Amazon | Microsoft | IBM |
|------------|--------|--------|-----------|-----|
| Google     | Yes    | No     | No        | No  |
|------------|--------|--------|-----------|-----|
| Amazon     |        | No     | No        | Yes |
|------------|--------|--------|-----------|-----|
| Microsoft  |        |        | Yes       | No  |
|------------|--------|--------|-----------|-----|
| IBM        |        |        |           | Yes |
|------------|--------|--------|-----------|-----|

The effect is as follows: When I play the large output file, it will always start playing the first sound file included. From there, if the next sound file is compatible, it is heard, otherwise it is skipped entirely (no empty sound or anything). If it was skipped, the 'length' of that file (for example 10s long audio file) is included at the end of the generated output sound file. However, the moment that my audio player hits the point where the last 'compatible' audio has played, it immediately skips to the end.

As a scenario:

Input:
sound1.mp3 (3s) -> Google
sound2.mp3 (5s) -> Amazon
sound3.mp3 (7s)-> Google
sound4.mp3 (11s) -> IBM

Output:
output.mp3 (26s) -> first 10s is sound1 and sound3, last 16s is skipped.

In this case, the output sound file would be 26s seconds long. For the first 10 seconds, you would hear the sound1.mp3 and sound3.mp3 played back to back. Then at 10s (at least playing this mp3 file in firefox) the player immediately skips to the end at 26s.

My question is: Does anyone have any ideas why sometimes I can concatenate audio data in this way, and other times I cannot? And how come there is this 'missing' data included at the end of the output file? Shouldn't concatenating the binary data work in all cases if it works for some cases, as all the files have mp3 encoding? If I am wrong please let me know what I can do to successfully concatenate any mp3 files :) I can provide my nodeJS backend code, but the process and methods used are described above.

Thanks for reading?


Solution

  • Potential Sources of Problems

    Sample Rate

    44.1 kHz is often used for music, as it's what is used on CD audio. 48 kHz is usually used for video, as it's what was used on DVDs. Both of those sample rates are much higher than is required for speech, so it's likely that your various text-to-speech providers are outputting something different. 22.05 kHz (half of 44.1 kHz) is common, and 11.025 kHz is out there too.

    While each frame specifies its own sample rate, making it possible to generate a stream with varying sample rates, I've never seen a decoder attempt to switch sample rates mid-stream. I suspect that the decoder is skipping these frames, or maybe even skipping over an arbitrary block until it gets consistent data again.

    Use something like FFmpeg (or FFprobe) to figure out what the sample rates of your files are:

    ffmpeg -i sound2.mp3
    

    You'll get an output like this:

    Duration: 00:13:50.22, start: 0.011995, bitrate: 192 kb/s
      Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 192 kb/s
    

    In this example, 44.1 kHz is the sample rate.

    Channel Count

    I'd expect your voice MP3s to be in mono, but it wouldn't hurt to check to be sure. As with above, check the output of FFmpeg. In my example above, it says stereo.

    As with sample rate, technically each frame could specify its own channel count but I don't know of any player that will pull off switching channel count mid-stream. Therefore, if you're concatenating, you need to make sure all the channel counts are the same.

    ID3 Tags

    It's common for there to be ID3 metadata at the beginning (ID3v2) and/or end (ID3v1) of the file. It's less expected to have this data mid-stream. You would want to make sure this metadata is all stripped out before concatenating.

    MP3 Bit Reservoir

    MP3 frames don't necessarily stand alone. If you have a constant bitrate stream, the encoder may still use less data to encode one frame, and more data to encode another. When this happens, some frames contain data for other frames. That way, frames that could benefit from the extra bandwidth can get it while still fitting the whole stream within a constant bitrate. This is the "bit reservoir".

    If you cut a stream and splice in another stream, you may split up a frame and its dependent frames. This typically causes an audio glitch, but may also cause the decoder to skip ahead. Some badly behaving decoders will just stop playing altogether. In your example, you're not cutting anything so this probably isn't the source of your trouble... but I mention it here because it's definitely relevant to the way you're working these streams.

    See also: http://wiki.hydrogenaud.io/index.php?title=Bit_reservoir

    Solutions

    Pick a "normal" format, resample and rencode non-conforming files

    If most of your sources are all the exact same format and only one or two outstanding, you could convert the non-conforming file. From there, strip ID3 tags from everything and concatenate away.

    To do the conversion, I'd recommend kicking it over to FFmpeg as a child process.

    child_process.spawn('ffmpeg' [
      // Input
      '-i', inputFile, // Use '-' to write to STDIN instead
    
      // Set sample rate
      '-ar', '44100',
    
      // Set audio channel count
      '-ac', '1',
    
      // Audio bitrate... try to match others, but not as critical
      '-b:a', '64k',
    
      // Ensure we output an MP3
      '-f', 'mp3',
    
      // Output
      outputFile // As with input, use '-' to write to STDOUT
    ]);
    

    Best Solution: Let FFmpeg (or similar) do the work for you

    The simplest, most robust solution to all of this is to let FFmpeg build a brand new stream for you. This will cause your audio files to be decoded to PCM, and a new stream made. You can add parameters to resample those inputs, and modify channel counts if needed. Then output one stream. Use the concat filter.

    This way, you can accept audio files of any type, you don't have to write the code to hack those streams together, and once setup you won't have to worry about it.

    The only downside is that it will require a re-encoding of everything, meaning another generation of quality lost. This would be required for any non-conforming files anyway, and it's just speech, so I wouldn't give it a second thought.