Search code examples
c#speech-to-textazure-cognitive-services

Microsoft.CognitiveServices.Speech.DetailedSpeechRecognitionResultCollection error


We're experimenting with speech-to-text using Microsoft Cognitive Services. One of our requirements is to have word level timestamps. This works fine with short wav files, say, 2-3 minutes of audio, but with larger files we're getting an error: "There was an error deserializing the object of type Microsoft.CognitiveServices.Speech.DetailedSpeechRecognitionResultCollection. The value '2152200000' cannot be parsed as the type 'Int32'."

Any and all hints as to how I can get around this would be greatly appreciated. Thanks in advance!

Code snippet:

    config.OutputFormat = OutputFormat.Detailed;
    config.RequestWordLevelTimestamps();

    using (var audioInput = AudioConfig.FromWavFileInput(wavfile))
    {
        using var recognizer = new SpeechRecognizer(config, audioInput);

        recognizer.Recognized += (s, e) =>
        {
            if (e.Result.Reason == ResultReason.RecognizedSpeech)
            {
                var framesStart = TimeSpan.FromTicks(e.Result.OffsetInTicks).TotalMilliseconds / 40;
                var te = new TranscriptElement((long)framesStart, e.Result.Text, languageCode);
                // Eventually fails on the following line:
                var words = e.Result.Best().OrderByDescending(x => x.Confidence).First().Words;
                foreach (var w in words.OrderBy(w => w.Offset))
                {
                    var start = TimeSpan.FromTicks(w.Offset).TotalMilliseconds / 40;
                    var duration = TimeSpan.FromTicks(w.Duration).TotalMilliseconds / 40;
                    te.SingleWords.Add(new TranscriptSingleWord((long)start, (long)(start + duration), w.Word));
                }

                transcriptElements.Add(te);
            }
            else if (e.Result.Reason == ResultReason.NoMatch)
            {
                _logger.LogError($"NOMATCH: Speech could not be recognized.");
            }
        };
        await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

        Task.WaitAny(new[] { stopRecognition.Task });

        await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
    }

Solution

  • It's a bug in the data type the extension is using for the offset. An int can only track ~214s of audio.

    You can access the raw JSON that the Best() method is using from the result's property collection through the SpeechServiceResponse_JsonResult property until a fix is available.