Search code examples
node.jsgoogle-text-to-speech

How to concatenate/join audio buffer arrays (text-to-speech results) into one on nodejs?


I want to convert many texts into one audio, but I'm confused how to concatenate many audios into single audio file (You can't convert a long text into an audio due to 5k chars/request limit).

My current code is below. It generates multiple audio byte arrays, but fails to merge mp3 audios because it ignores head/meta information. Is it recommended to use LINEAR16 in TTS field? I'm happy to hear any suggestion. Thank you.

  const client = new textToSpeech.TextToSpeechClient();
  const promises = ['hi','world'].map(text => {
    const requestBody = {
      audioConfig: {
        audioEncoding: 'MP3'
      },
      input: {
        text: text,
      },
      voice: {
        languageCode: 'en-US',
        ssmlGender: 'NEUTRAL'
      },
    };
    return client.synthesizeSpeech(requestBody)
  })
  const responses = await Promise.all(promises)
  console.log(responses)
  const audioContents = responses.map(res => res[0].audioContent)
  const audioContent = audioContents.join() // this line has a problem

standard output

[
  [
    {
      audioContent: <Buffer ff f3 44 c4 00 12 a0 01 24 01 40 00 01 7c 06 43 fa 7f 80 38 46 63 fe 1f 00 33 3f c7 f0 03 03 33 1f c1 f0 0c eb fa 3f 03 20 7e 63 f3 78 03 ba 64 73 e0 ... 2638 more bytes>
    },
    null,
    null
  ],
  [
    {
      audioContent: <Buffer ff f3 44 c4 00 12 58 05 24 01 41 00 01 1e 02 23 9e 1f e0 1f 83 83 df ef 80 e8 ff 99 f0 0c 00 e8 7f c3 68 03 cf fd f8 8f ff 0f 3c 7f 88 f8 8c 87 e0 23 ... 2926 more bytes>
    },
    null,
    null
  ]
]

Solution

  • Workaround-1

    As I mentioned in the comment, there is a google-tts-concat-ssml package in the node for your requirement, which is not a Google official package. It would automatically make multiple requests based on the 5K character limit to the API and concatenate the resulting audio into a single audio file. Before executing the code, install the following client libraries:

    npm install @google-cloud/text-to-speech
    npm install google-text-to-speech-concat --save
    

    Try the below code by adding less than 5k characters between the <p></p> tag for each request to the API. For example, if you have 9K characters, then it would need to be split into 2 or more requests, so add the first 5K characters between <p></p> tag and then next add the remaining 4k characters between the new <p></p> tag. So, by using the google-text-to-speech-concat package, the API returned audio files are concatenated into a single audio file.

    const textToSpeech =require('@google-cloud/text-to-speech');
    const testSynthesize =require('google-text-to-speech-concat');
    const fs = require('fs');
    const path= require('path');
    (async () => {
     const request = {
       voice: {
         languageCode: 'en-US',
         ssmlGender: 'FEMALE'
       },
       input: {
         ssml: `
         <speak>
         <p>add less than 5k chars between paragraph tags</p>
         <p>add less than 5k chars between paragraph tags</p>
         </speak>`
       },
       audioConfig: {
         audioEncoding: 'MP3'
       }
     };
     try {
       // Create your Text To Speech client
       // More on that here: https://cloud.google.com/docs/authentication/production#providing_credentials_to_your_application
       const textToSpeechClient = new textToSpeech.TextToSpeechClient({
         keyFilename: path.join(__dirname, 'google-cloud-credentials.json')
       });
       // Synthesize the text, resulting in an audio buffer
       const buffer = await testSynthesize.synthesize(textToSpeechClient, request);
       // Handle the buffer
       // For example write it to a file or directly upload it to storage, like S3 or Google Cloud Storage
       const outputFile = path.join(__dirname, 'Output.mp3');
       // Write the file
       fs.writeFile(outputFile, buffer, 'binary', (err) => {
         if (err) throw err;
         console.log('Got audio!', outputFile);
       });
     } catch (err) {
       console.log(err);
     }
    })();
    
    

    Workaround-2

    Try the below code to split the entire text into sets of 5K characters and send them to the API for conversion. This creates multiple audio files, as you know. Before executing the code, create a folder in your current working directory to store the output audio files.

    const textToSpeech = require('@google-cloud/text-to-speech');
    const fs = require('fs');
    const util = require('util');
     
    // Creates a client
    const client = new textToSpeech.TextToSpeechClient();
     
    (async function () {
     
     // The text to synthesize
     var text = fs.readFileSync('./text.txt', 'utf8');
     var newArr = text.match(/[^\.]+\./g);
     
     var charCount = 0;
     var textChunk = "";
     var index = 0;
     
     for (var n = 0; n < newArr.length; n++) {
     
       charCount += newArr[n].length;
       textChunk = textChunk + newArr[n];
     
       console.log(charCount);
     
       if (charCount > 4600 || n == newArr.length - 1) {
     
         console.log(textChunk);
     
         // Construct the request
         const request = {
           input: {
             text: textChunk
           },
           // Select the language and SSML voice gender (optional)
           voice: {
             languageCode: 'en-US',
             ssmlGender: 'MALE',
             name: "en-US-Wavenet-B"
           },
           // select the type of audio encoding
           audioConfig: {
             effectsProfileId: [
               "headphone-class-device"
             ],
             pitch: -2,
             speakingRate: 1.1,
             audioEncoding: "MP3"
           },
         };
     
         // Performs the text-to-speech request
         const [response] = await client.synthesizeSpeech(request);
     
         console.log(response);
     
         // Write the binary audio content to a local file
         const writeFile = util.promisify(fs.writeFile);
         await writeFile('result/Output' + index + '.mp3', response.audioContent, 'binary');
         console.log('Audio content written to file: output.mp3');
     
         index++;
     
         charCount = 0;
         textChunk = "";
       }
     }
    }());
    

    For merging the output audio files into a single audio file, the audioconcat package can be used, which is not a Google official package. You can also use other similar available packages to concat the audio files.

    To use this audioconcat library requires that the ffmpeg application (not the ffmpeg NPM package) is already installed. So, install the ffmpeg tool based on your OS and install the following client libraries before executing the code for concatenating audio files:

    npm install audioconcat
    npm install ffmpeg --enable-libmp3lame
    

    Try the below code, it concatenates all the audio files from your output directory and stores the single concatenated output.mp3 audio file in your current working directory.

    const audioconcat = require('audioconcat')
    const testFolder = 'result/';
    const fs = require('fs');
    var array=[];
    fs.readdirSync(testFolder).forEach(songs => {
     array.push("result/"+songs);
     console.log(songs);
    });
     
    audioconcat(array)
     .concat('output.mp3')
     .on('start', function (command) {
       console.log('ffmpeg process started:', command)
     })
     .on('error', function (err, stdout, stderr) {
       console.error('Error:', err)
       console.error('ffmpeg stderr:', stderr)
     })
     .on('end', function (output) {
       console.error('Audio successfully created', output)
     })
    

    For both the workarounds, I tested codes from the various GitHub links and modified the code as per your requirement. Here are the links for your reference.

    1. Workaround-1
    2. Workaround-2