Search code examples
node.jsazuretext-to-speechazure-cognitive-servicesazure-speech

Inconsistent Latency Discrepancy in Text to Speech Synthesizer Between Local and Production Environments


I'm encountering a discrepancy in the performance of the Text to Speech synthesizer. When I use speakTextAsync locally, the 'first byte latency' and 'finish latency' are consistently less than 200 ms for texts of varying lengths. However, when the same function is invoked in the production environment, I observe a latency ranging from 600-800 ms (using same text used in local env). Could you please help me understand the reasons behind this difference?

Code:

async getTextToSpeech(ctx, next){
        const requestBody = ctx.request.body;
        const speechKey = process.env.SPEECH_KEY;
        const speechRegion = process.env.SPEECH_REGION;

        let response;
        response = new RESPONSE_MESSAGE.GenericSuccessMessage();
        if (!speechKey || !speechRegion) {
            console.log('Please set the environment variables SPEECH_KEY and SPEECH_REGION');
            process.exit(1);
        }

        const text = requestBody.speechText

        const speechConfig = sdk.SpeechConfig.fromSubscription(speechKey, speechRegion);
        speechConfig.speechSynthesisVoiceName = 'hi-IN-MadhurNeural';
        speechConfig.speechSynthesisLanguage = 'hi-IN';
        speechConfig.speechSynthesisOutputFormat = sdk.SpeechSynthesisOutputFormat.Riff8Khz16BitMonoPcm;

        const pullStream = sdk.AudioOutputStream.createPullStream();
        const audioConfig = sdk.AudioConfig.fromStreamOutput(pullStream);
        const synthesizer = new sdk.SpeechSynthesizer(speechConfig, audioConfig);

        const outputFilePath = 'tts_output.wav';
        const outputFileStream = fs.createWriteStream(outputFilePath);
        outputFileStream.on("error", err => console.log(err));

        await new Promise((resolve, reject) => {
            synthesizer.speakTextAsync(text, (result) => {
                if (result.reason === sdk.ResultReason.SynthesizingAudioCompleted) {
                    outputFileStream.write(Buffer.from(result.audioData));
                    console.log(`TTS audio saved to: ${outputFilePath}`);
                    resolve();
                } else {
                    console.log("Error");
                    reject(new Error(`Speech synthesis failed: ${result.errorDetails}`));
                }
            });
        });
        return outputFilePath;
    }

I attempted to use the speakTextAsync function in both local and production environments for Text to Speech synthesis. In the local environment, I expected to observe 'first byte latency' and 'finish latency' consistently below 200 ms for various text lengths. However, when the same function was executed in the production environment using the same text as in the local environment, I encountered unexpected latencies ranging from 600-800 ms. I am seeking assistance to understand the reasons behind this performance difference.


Solution

    • The code below is used to synthesize text-to-speech using the Microsoft Azure Cognitive Services Speech SDK.
    • Refer to this link for latency discrepancies you're encountering between your local and production environments for Text to Speech synthesis using Azure Cognitive Services.
    
    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, audioConfig);
    
    // Serve the HTML form for input
    app.get('/', (req, res) => {
        res.send(`
            <form action="/synthesize" method="post">
                <label for="text">Enter some text that you want to speak:</label><br>
                <input type="text" id="text" name="text"><br>
                <button type="submit">Submit</button>
            </form>
        `);
    });
    
    // Handle text input and synthesis
    app.post('/synthesize', express.urlencoded({ extended: true }), (req, res) => {
        const text = req.body.text;
    
        // Start the synthesizer and wait for a result.
        synthesizer.speakTextAsync(text,
            function (result) {
                if (result.reason === sdk.ResultReason.SynthesizingAudioCompleted) {
                    console.log("synthesis finished.");
                    res.send(`
                        <p>Speech synthesis completed. Check your console for details.</p>
                        <p>Output audio file: <a href="/audio">${audioFile}</a></p>
                    `);
                } else {
                    console.error("Speech synthesis canceled, " + result.errorDetails +
                        "\nDid you set the speech resource key and region values?");
                    res.status(500).send('Speech synthesis failed.');
                }
                synthesizer.close();
            },
            function (err) {
                console.trace("err - " + err);
                res.status(500).send('Speech synthesis failed.');
                synthesizer.close();
            });
        console.log("Now synthesizing to: " + audioFile);
    });
    
    // Serve the synthesized audio file
    app.get('/audio', (req, res) => {
        res.sendFile(__dirname + '/' + audioFile);
    });
    
    app.listen(port, () => {
        console.log(`Speech synthesis app listening at http://localhost:${port}`);
    });
    
    
    
    

    Local: enter image description here

    enter image description here

    enter image description here

    Deployment status: enter image description here

    Azure: enter image description here