c#speech-recognition speech-to-text azure-cognitive-services

Microsoft Cognitive SpeechRecognizer Stuck

I'm trying to do speech-to-text on some wave files using the MS cognitive Speech SDK. It works well enough for some files but it gets stuck for others. By stuck, I mean that it doesn't stop until cancelled manually.

I tried first with the RecognizeOnceAsync method:

private static void processRecording()
{
    var speechConfig = SpeechConfig.FromSubscription("mykey", "myregion");
    speechConfig.SpeechRecognitionLanguage = "es-MX";
    speechConfig.OutputFormat = OutputFormat.Detailed;

    using (var audioStream = new PushAudioInputStream())
    {
        audioStream.Write(File.ReadAllBytes("myfilepath"));
        using (var audioConfig = AudioConfig.FromStreamInput(audioStream))
        {
            using (var speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig))
            {
                var result = speechRecognizer.RecognizeOnceAsync().Result;
                switch (result.Reason)
                {
                    case ResultReason.RecognizedSpeech:
                        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
                        break;
                    case ResultReason.NoMatch:
                        Console.WriteLine($"NOMATCH: Speech could not be recognized.");
                        break;
                    case ResultReason.Canceled:
                        var cancellation = CancellationDetails.FromResult(result);
                        Console.WriteLine($"CANCELED: Reason={cancellation.Reason}, ErrorCode={cancellation.ErrorCode}, ErrorDetails={cancellation.ErrorDetails}");
                        break;
                }
            }
        }
    }
}

And with this I get (after over a minute):

CANCELED: Reason=Error, ErrorCode=ServiceTimeout, ErrorDetails=Timeout: no recognition result received SessionId: 322853a3085d41ec9b60ee940531038c

I then tried with StartContinuousRecognitionAsync:

private async static Task processRecordingsAsync()
{
    var speechConfig = SpeechConfig.FromSubscription("mykey", "myregion");
    speechConfig.SpeechRecognitionLanguage = "es-MX";
    speechConfig.OutputFormat = OutputFormat.Detailed;

    var waiter = new System.Threading.ManualResetEvent(false);

    var audioStream = new PushAudioInputStream();
    audioStream.Write(File.ReadAllBytes("myfilepath"));
    var audioConfig = AudioConfig.FromStreamInput(audioStream);
    var speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);
    Action cleanup = () =>
    {
        waiter.Set();
        try { speechRecognizer.Dispose(); } catch { }
        try { audioConfig.Dispose(); } catch { }
        try { audioStream.Dispose(); } catch { }
    };
    speechRecognizer.Recognizing += (sender, e) => Console.WriteLine($"Recognizing: {e.Result.Text}");
    speechRecognizer.SessionStarted += (sender, e) => Console.WriteLine($"Recognize session started");
    speechRecognizer.SessionStopped += (sender, e) => Console.WriteLine($"Recognize session stopped");
    speechRecognizer.SpeechEndDetected += (sender, e) => Console.WriteLine($"Speech end detected");
    speechRecognizer.SpeechStartDetected += (sender, e) => Console.WriteLine($"Speech start detected");
    speechRecognizer.Recognized += (sender, e) =>
    {
        if (e.Result.Reason == ResultReason.RecognizedSpeech)
        {
            Console.WriteLine($"Recognized text: {e.Result.Text}");
        }
        else
        {
            Console.WriteLine($"Could not recognize text: {e.Result.Reason}");
        }
        cleanup();
    };
    speechRecognizer.Canceled += (sender, e) =>
    {
        Console.WriteLine($"Error trying to recognize text: Reason = {e.Reason}, ErrorCode = {e.ErrorCode}, ErrorDetails = {e.ErrorDetails}");
        cleanup();
    };
    await speechRecognizer.StartContinuousRecognitionAsync();
    if (!waiter.WaitOne(60000))
    {
        await speechRecognizer.StopContinuousRecognitionAsync();
    }
}

And with this I get:

Recognize session started
Speech start detected
Recognizing: con el
Recognizing: con el servicio de tele
Recognizing: con el servicio de tele terapia
Recognizing: con el servicio de tele terapia de
Recognizing: con el servicio de tele terapia de tercer
Recognize session stopped
Error trying to recognize text: Reason = Error, ErrorCode = ServiceTimeout, ErrorDetails = Timeout while waiting for service to stop SessionId: e289298cf97447b89bd088a665e6c095

So it's doing about 90% of the file (which is about 4 seconds long) but it gets stuck and doesn't end till I force it with StopContinuousRecognitionAsync.

When I try this file on the speech studio, it recognizes almost exactly the same thing but it does not get stuck.

Note that I am using a free subscription. Could it be because of that? Is there anything else I could try?

Solution

The reason you're seeing this is that the audio input stream being used is still patiently "waiting" for the possibility of more data being pushed to it. The stream has no way of knowing that this is the complete contents of a file versus, say, an ongoing forwarding of a real-time input stream that just got blocked for a few seconds. If the end of the stream didn't have enough trailing silence tacked onto it, that hypothetical future data could even influence the final recognition result you'd receive--which is why you'd see the end of the file not yet being recognized (it's not finalized).

Two likely fixes:

Call .Close() on the PushAudioInputStream or write an empty buffer (.Write(new byte[0])) to explicitly mark the end of the stream and allow the SDK to wrap things up without waiting for more data
If it's just file input, consider using AudioConfig.FromWavFileInput to avoid needing to any of these steps yourself.

Just as one additional note: I wouldn't recommend invoking .Dispose on these SDK objects from within a callback (event) originating from the same objects. It can lead to some funniness if there are still pending callbacks awaiting dispatch once the callback that's calling Dispose finishes. If a more prompt disposal is needed than what IDisposable will provide via the using statements in effect, doing it on either the main thread (e.g. via awaiting a TaskCompletionSource signaled on completion) or a new Task thread (Thread.Run(() => cleanup())) will avoid any potential concurrency issues with teardown and eventing.