I am a software engineer working at a company that uses TTS for telephony projects. When I place calls to test that our VUI and it's corresponding functions and TTS prompts are working correctly, I often run into the following problem.
When I run tests (placing phone calls and navigating the VUI), in our local environment I'll randomly have prompts that stop playing for a few seconds. Instead of hearing the prompt, there is silence, and then the prompt picks up where you'd expect it to be a few seconds from where the cut off began.
For example, take the prompt: "Hello, thank you for calling today." At certain times, while testing in our local environment, I'll hear, for example, "Hello, brief silence calling today."
But, when I run the exact same test in our environment that we deploy to, I hear the same prompt just as I'd expect it. I know environment issues can be common with TTS, specifically prompts cutting out and not playing clearly, but I'm curious, can anyone elaborate on what these "environment problems" could be? Furthermore, I do know that these issues aren't grammar issues. I'll run tests where the prompt is spoken perfectly, but then when I give a no-input or no-match response, to hit the next function, which in that case is essentially the same exact prompt, the cut-off / silence occurs.
Any information, sites or books are much appreciated. I personally haven't found anything online about this stuff. Thanks in advance!
TTS - Text to Speech is an active process. Depending on how your platform implements TTS, it might be getting directly streamed from the TTS server. What may be happening is that the TTS engine can't keep with the request.
If this is on premise (unlikely these days), monitor the performance of the TTS server(s). CPU is the best metric. If the platform uses MRCP (likely) logs for that communication may provide insights.
If this is a hosted solution, contact your provider. Odds are, their test environment is underprovisioned for TTS. Mostly because in test environments, everybody else is doing the same. In production, many apps switch to recorded audio for quality, so the scale of TTS resources is reduced.
For an ugly hack, you could play a recording (actual audio file) of 1s of silence at the beginning of all forms. This might give the TTS server enough time to get ahead and buffer the audio generation.