c#audio-streaming streamreader hololens azure-speech

Reading WindowsMicrophoneStream for Azure Speech SDK on Hololens

I want to perform real-time speech recognition for the Hololens 2 with Unity 2021 and I am using the Microsoft Azure Cognitive Services Speech SDK to do so. Instead of the default Hololens 2 microphone stream, I want to switch to the Stream Category "room capture", for which I must use the Windows Microphone Stream (see link). The Windows Microphone Stream initialization and starting also succeeds with this code:

    //create windows mic stream
    micStream = new WindowsMicrophoneStream();
    if (micStream == null)
    {
        Debug.Log("Failed to create the Windows Microphone Stream object");
    }

    //init windows mic stream
    WindowsMicrophoneStreamErrorCode result = micStream.Initialize(streamType);
    if (result != WindowsMicrophoneStreamErrorCode.Success)
    {
        Debug.Log($"Failed to initialize the microphone stream. {result}");
        return;
    }
    else Debug.Log($"Initialized the microphone stream. {result}");

    // Start the microphone stream.
    result = micStream.StartStream(true, false);
    if (result != WindowsMicrophoneStreamErrorCode.Success)
    {
        Debug.Log($"Failed to start the microphone stream. {result}");
    }
    else Debug.Log($"Started the microphone stream. {result}");

I don't really have much knowledge concerning audio streams, but I guess for the Speech SDK to get the room capture, I have to feed it with this mic stream. My problem is that I have not found a way to do that. I guess that I would probably have to implement my own PullAudioInputStreamCallback class (as e.g. here), but I don't know how Read() should be implemented for the Windows Microphone Stream. Additionally, I considered to use a PushStream like so:

        SpeechConfig speechConfig = SpeechConfig.FromSubscription(SpeechController.Instance.SpeechServiceAPIKey, SpeechController.Instance.SpeechServiceRegion);
        speechConfig.SpeechRecognitionLanguage = fromLanguage;
        using (var pushStream = AudioInputStream.CreatePushStream())
        {
            using (var audioInput = AudioConfig.FromStreamInput(pushStream))
            {
                using (var recognizer = new SpeechRecognizer(speechConfig, audioInput))
                {
                    recognizer.Recognizing += RecognizingHandler;
                    ...

                    await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

                    // The "MicStreamReader" is not implemented! 
                    using (MicStreamReader reader = new MicStreamReader(MicStream))
                    {
                        byte[] buffer = new byte[1000];
                        while (true)
                        {
                            var readSamples = reader.Read(buffer, (uint)buffer.Length);
                            if (readSamples == 0)
                            {
                                break;
                            }
                            pushStream.Write(buffer, readSamples);
                        }
                    }
                    pushStream.Close();
                }
            }
        }

But I would need something like a "MicStreamReader" in this code. Could you help me with this approach or do you know a better one?

Solution

I would suggest the following steps:

Use https://github.com/microsoft/MixedRealityToolkit-Unity/blob/htk_release/Assets/HoloToolkit-Examples/Input/Scripts/MicStreamDemo.cs as a base where you create the MicStream with the desired stream category and then read the audio frames using MicStream.MicGetFrame in OnAudioFilterRead callback method.
Modify the sample (1) and create there also Speech SDK's SpeechRecognizer with PushAudioStream configuration. Then write to the Speech SDK's push stream in OnAudioFilterRead callback method for each audio frame read. Now as MicStream.MicGetFrame returns audios in floats, you need to convert them to 16bit pcm before writing to SDK. For float to pcm conversion example, please check the following sample which uses Unity microphone to capture the audio and write it to Speech SDK using pushstream https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/csharp/unity/from-unitymicrophone/Assets/Scripts/HelloWorld.cs.