Search code examples
.netwindowswinapiuwp

Windows Speech recognition APIs


I have noticed that in Windows 10 and 11, you can press Win+H to start a "Voice Typing" feature. You speak to a microphone and the widget sends keystrokes to whatever window has the focus. The recognition works surprisingly well, and it does so in multiple languages (e.g. Italian) and locally (no internet connection required, I've tried it by disconnecting the PC).

I was wondering if there is a way to get access to the same speech recognition engine.

Internet search suggests that for Windows developers Microsoft offers many different engines.

  • In classic .net applications there is the one found under the System.Speech.Recognition namespace. But this doesn't work in Italian1
  • There is the "Microsoft Speech Platform" (Microsoft.Speech.Recognition) which is similar to System.Speech.Recognition, but intended for server apps. I don't have that installed here2
  • In UWP applications there is Windows.Media.SpeechRecognition. (this only works online)
  • There is also the "Speech SDK" (Microsoft.CognitiveServices.Speech) which looks like a wrapper around a REST API to the Azure Cognitive Services (online).

The question is: what is "voice typing" using, and can I get access to that?


  1. Italian doesn't work, verified via this code:

    foreach (RecognizerInfo info in SpeechRecognitionEngine.InstalledRecognizers())
    {
        Debug.WriteLine(info.Culture);
    }
    

    Giving me only en-US, although other language packs are installed and the "voice typing" feature works in Italian on the same machine and user.

  2. I don't have the speech server runtime installed, verified via the absence of a \HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech Server\ registry entry. So I don't think the Windows widget is using this.


Solution

  • Thanks to @SimonMourier for his comment.

    It looks like the Voice Typing feature is using a hybrid approach where online services are used when available and an offline model is used when the internet isn't accessible.

    The API used would be the one from "Azure Cognitive Services": https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech

    This normally happens online, but they can also provide an "embedded" model which runs locally: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/embedded-speech?tabs=windows-target%2Cjre&pivots=programming-language-csharp

    However, to get access to this model, your usecase needs to be approved by Azure (since making it publicly accessible would mean losing some of that sweet sweet money). You can fill out this form to request access: https://customervoice.microsoft.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR7en2Ais5pxKtso_Pz4b1_xUMFNKUU9RTU1UTkdUMzVYUkxDOFZRMVFGSyQlQCN0PWcu

    To then have it rejected withing 10 business days.

    When something's too good to be true...