Audio File to Text Using SAPI or Equally Capable SR

First let me explain my goal. The goal I'm working towards is providing an input .wav file, sending it into some kind of Speech Recognition API, and returning a text file with the transcription. The application I have in mind is very simple. I do not require that it be parsed for grammar or punctuation. It can return a big, long sentence -- that's fine. I will treat each transcribed word as an observation in a text file (.tsv or .csv format)

However, the one tricky bit of data (tricky because 95% of all 3rd party audio transcription services I've reviewed don't provide this kind of data to the user) that I do need is the [0.00 - 1.00] confidence score of each word the SR takes its guess on. I would like to store that data in a new column of the text file that contains the transcribed text either in .tsv or .csv format.

That's it. That's my goal. It seems my goal is possible: here is a quote from an expert in a related post:

Convert Audio(Wav file) to Text using SAPI?

SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream, set dictation mode, and off you go.

and here is the relevant documentation for .wav transcription confidence scores:

https://msdn.microsoft.com/en-us/library/jj127911.aspx

https://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.recognizedwordunit.confidence(v=office.14).aspx

Everyone makes it sound so simple, but now let me explain the problem; why I'm posting a question. The problem is that, for me, my goal is out of reach because I know next to nothing about c++ or COM. I thought that SAPI was part of the everday windows experience and had a dedicated, friendly user interface. So I grew increasingly alarmed the more I researched this procedure. However I still believe that in principle this is a very simple thing, so I'm optimistic.

I have knowledge in Python and a little JS. I know Python has code magic for other languages, so I'm sure Python can interface with SAPI this way, but since I don't know c++, I don't think that would make me any better off.

So just to reiterate, despite the skill mismatch, I'm still partial to SAPI because all the user friendly alternatives, like Dragon, Nuance, Chrome plug-ins, ect, don't provide the data granularity I need.

Now let me get to the heart of my question:

Can someone give me their assessment on the difficulty of my "goal" as described above? Could it be done in a single .bat file? Example code would be greatly appreciated.

Solution

It probably goes without saying, but I think you're going to find it difficult to work with SAPI's C interface if you don't have a strong handle on C as a language. I wrote a program that does almost exactly what you're talking about some time ago to test the concept. First a code dump:

#include "dirent.h"
#include <iostream>
#include <string>
#include <sapi.h>
#include <sphelper.h>

int main(int argc, char* argv[]){

    DIR *dir;
    struct dirent* entry;
    struct stat* statbuf;
    ::CoInitialize(NULL);
    if((dir = opendir(".")) != NULL){
        while((entry = readdir(dir)) != NULL){
            char extCheck[260];
            strcpy(extCheck, entry->d_name);
            if(strlen(extCheck) > 4 && !strcmp(strlwr(extCheck) + strlen(extCheck)-4, ".wav")){
                //printf("%s\n",entry->d_name);
                //1. Find the wav files
                //2. Check the wavs to make sure they're the correct format
                //3. Output any errors to the error log
                //4. Produce the text files for the wavs
                //5. Cleanup and exit
                FILE* fp;
                std::string fileName = std::string(entry->d_name,entry->d_name + strlen(entry->d_name)-4);
                fileName += ".txt";
                fp = fopen(fileName.c_str(), "w+");
                HRESULT hr = S_OK;
                CComPtr<ISpStream> cpInputStream;
                CComPtr<ISpRecognizer> cpRecognizer;
                CComPtr<ISpRecoContext> cpRecoContext;
                CComPtr<ISpRecoGrammar> cpRecoGrammar;
                CSpStreamFormat sInputFormat;
                hr = cpRecognizer.CoCreateInstance(CLSID_SpInprocRecognizer);
                hr = cpInputStream.CoCreateInstance(CLSID_SpStream);
                hr = sInputFormat.AssignFormat(SPSF_16kHz16BitStereo);
                std::string sInputFileName = entry->d_name;
                std::wstring wInputFileName = std::wstring(sInputFileName.begin(), sInputFileName.end());
                hr = cpInputStream->BindToFile(wInputFileName.c_str(), SPFM_OPEN_READONLY, &sInputFormat.FormatId(), sInputFormat.WaveFormatExPtr(), SPFEI_ALL_EVENTS);
                hr = cpRecognizer->SetInput(cpInputStream, TRUE);
                hr = cpRecognizer->CreateRecoContext(&cpRecoContext);
                hr = cpRecoContext->CreateGrammar(NULL, &cpRecoGrammar);
                hr = cpRecoGrammar->LoadDictation(NULL,SPLO_STATIC);

                hr = cpRecoContext->SetNotifyWin32Event();
                auto hEvent = cpRecoContext->GetNotifyEventHandle();
                hr = cpRecoContext->SetInterest(SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_END_SR_STREAM), SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_END_SR_STREAM));
                hr = cpRecoGrammar->SetDictationState(SPRS_ACTIVE);
                BOOL fEndStreamReached = FALSE;
                unsigned int timeOut = 0;
                //WaitForSingleObject(hEvent, INFINITE);
                while (!fEndStreamReached && S_OK == cpRecoContext->WaitForNotifyEvent(INFINITE)){
                    CSpEvent spEvent;

                     while (!fEndStreamReached && S_OK == spEvent.GetFrom(cpRecoContext)){

                        switch (spEvent.eEventId){

                            case SPEI_RECOGNITION:
                                {
                                    auto pPhrase = spEvent.RecoResult();
                                    SPPHRASE *phrase = nullptr;// new SPPHRASE();
                                    LPWSTR* text = new LPWSTR(L"");
                                    pPhrase->GetText(SP_GETWHOLEPHRASE, SP_GETWHOLEPHRASE, TRUE, text, NULL);
                                    pPhrase->GetPhrase(&phrase);

                                    if(phrase != NULL && phrase->pElements != NULL) {
                                        std::wstring wRuleName = L"";

                                        if(nullptr != phrase && phrase->Rule.pszName != NULL) {
                                            wRuleName = phrase->Rule.pszName;
                                        }

                                        std::wstring recognizedText = L"";
                                        bool firstWord = true;
                                        for(ULONG i = 0; i < (ULONG)phrase->Rule.ulCountOfElements; ++i) {

                                            if(phrase->pElements[i].pszDisplayText != NULL) {

                                                std::wstring outString = phrase->pElements[i].pszDisplayText;
                                                std::string soutString = std::string(outString.begin(), outString.end());
                                                if(!firstWord){
                                                    soutString = " " + soutString;
                                                    firstWord = false;
                                                }
                                                soutString = soutString + " ";
                                                fputs(soutString.c_str(),fp);
                                                /*if(recognizedText != L"") {
                                                    recognizedText += L" " + outString;
                                                } else {
                                                    recognizedText += outString;
                                                }*/
                                            }
                                        }

                                    }
                                    delete[] text;

                                    break;
                                }

                            case SPEI_END_SR_STREAM:
                                {
                                    fEndStreamReached = TRUE;
                                    break;
                                }

                        }

                        // clear any event data/object references
                        spEvent.Clear();
                    }
                }

                hr = cpRecoGrammar->SetDictationState(SPRS_INACTIVE);
                hr = cpRecoGrammar->UnloadDictation();
                hr = cpInputStream->Close();

                fclose(fp);
            }
        }
        closedir(dir);
    } else {
        perror("Error opening directory");
    }

    ::CoUninitialize();

    std::printf("Press any key to continue...");
    std::getchar();
    return 0;
}

I haven't run this in a long time, but you'll have to get dirent.h for it work. I was playing around with that library for no other reason than to try it out.

With the code provided you could probably start looking at what confidence values are getting generated at the recognition step. You could also tweak this to run from a batch file if you wanted to.

The problems that I faced were the following:

Accuracy was a problem, and in order to improve it I would have to train the recognizer, which was going to require a lot more time than I had.
I found that direct translation into text wasn't what I really wanted after all. As it turns out the phoneme data is quite a bit more important. With that you can form your own confidence scheme, and develop your own alternatives specific to your application.
Window's Recognizer, while good, isn't going to recognize words that it doesn't know about. You'll have to figure out how to add your vocabulary to windows' speech recognizer lexicon.

With that said, it's not a trivial undertaking to use the stock windows desktop speech recognizer. I'd take a look at some existing APIs out there. If you're not limited to client side only applications, you'd do well to look into other APIs.