We are currently evaluating the Bing Speech Recognition Service in a live streaming scenario. We are getting a live stream of PCM encoded audio (16k samplerate, 16bit, 1 channel (aka mono)) and trying to send this to the Bing Speech Recognition service.
We have successfully used the DataRecognitionClient from https://www.nuget.org/packages/Microsoft.ProjectOxford.SpeechRecognition-x64/ with our scenario by sending the audio format prior to streaming the audio itself, like so:
_dataRecognitionClient.SendAudioFormat(SpeechAudioFormat.create16BitPCMFormat(16000));
We are then streaming the audio stream in a loop like so:
_dataRecognitionClient.SendAudio(buffer, bytesRead);
This works fine. However we assume that the ProjectOxford library might get deprecated, since the official Bing Speech Recognition website (https://www.microsoft.com/cognitive-services/en-us/Speech-api/documentation/GetStarted/GetStartedCSharpServiceLibrary) points to a different Nuget package, see: https://www.nuget.org/packages/Microsoft.Bing.Speech/
When we are using the SpeechClient from this package, we are seeing the mentioned "Audio format could not be parsed" error when executing RecognizeAsync on the SpeechClient.
var speechInput = new SpeechInput(producerConsumerStream,
new RequestMetadata(Guid.NewGuid(), new DeviceMetadata(DeviceType.Near,
DeviceFamily.Desktop, NetworkType.Ethernet, OsName.Windows, "Azure",
"Microsoft", "Current"), new ApplicationMetadata("App", "1.0"), "Speech"));
await _speechClient.RecognizeAsync(speechInput, new CancellationToken());
The last line throws the error. We assume that this is because our PCM stream does not have a WAVE/RIFF header since it is streaming. For the streaming scenario the DataRecognitionClient had the "SendAudioFormat" method.
Does SpeechClient not support a streaming scenario?
Answering my own question. We have solved the issue by prepending a WAVE header with a fake total number of samples (aka length) to the stream, see: Create valid wav file header for streams in memory