Search code examples
speech-recognitionspeech-to-textazure-cognitive-servicesmicrosoft-speech-apimicrosoft-speech-platform

Difference among Microsoft Speech products/platforms


It seems Microsoft offers quite a few speech recognition products, I'd like to know the differences among all of them pls.

  • There is Microsoft Speech API, or SAPI. But somehow Microsoft Cognitive Service Speech API has the same name.

  • Ok now, Microsoft Cognitive Service on Azure offers Speech service API and Bing Speech API. I assume for speech-to-text, both APIs are the same.

  • And then there is System.Speech.Recognition (or Desktop SAPI), Microsoft.Speech.Recognition (or Server SAPI) and Windows.Media.Speech.Recognition. Here and here have some explanations on the difference among the three. But my guesses are they are old speech recognition models based on HMM, aka are not neural network models, and all three can be used offline without internet connection, right?

  • For the Azure speech service and bing speech APIs, they are more advanced speech models right? But I assume there is no way to use them offline on my local machine, as they all require subscription verification. (even tho it seems Bing API has a C# desktop library..)

Essentially I want to have a offline model which does speech-to-text transcription, for my conversation data (5-10 mins for each audio recording), which recognises multi-speakers and outputs timestamps (or timecoded output). I am a bit confused now by all the options. I would be greatly appreciated if someone can explain to me, many thanks!


Solution

  • A difficult question - and part of the reason why it is so difficult: We (Microsoft) seem to present an incoherent story about 'speech' and 'speech apis'. Although I work for Microsoft, the following is my view on this. I try to give some insight on what is being planned in my team (Cognitive Service Speech - Client SDK), but I can't predict all facets of the not-so-near-future.

    Early on Microsoft recognized that speech is an important medium, so Microsoft has an extensive and long running history enabling speech in its products. There are really good speech solutions (with local recognition) available, you listed some of those.

    We are working on unifying this, and present one place for you to find the state-of-the-art speech solution at Microsoft. This is 'Microsoft Speech Service' (https://learn.microsoft.com/de-de/azure/cognitive-services/speech-service/) - currently in preview.

    On the service side it will combine our major speech technologies, like speech-to-text, text-to-speech, intent, translation (and future services) under one umbrella. Speech and languages models are constantly improved and updated. We are developing a client SDK for this service. Over time (later this year) this SDK will be available on all major operating systems (Windows, Linux, Android, iOS) and have support for major programming languages. We will continue to enhance/improve platform and language support for the SDK.

    This combination of online service and client SDK will leave the preview-state later this year.

    We understand the desire to have local recognition capabilities. It will not be available 'out-of-the-box' in our first SDK release (it is also not part of the current preview). One goal for the SDK is parity (functionality and API) between platforms and languages. This needs a lot of work. Offline is not part of this right now, I can't make any prediction here, neither in features nor timeline ...

    So from my point of view - the new Speech Services and the SDK is the way forward. The goal is a unified API on all platforms, easy access to all Microsoft Speech Services. It requires the subscription key, it requires you are 'connected'. We are working hard to get both (server and client) out of preview status later this year.

    Hope this helps ...

    Wolfgang