Search code examples
c++windowstext-to-speechsapi

How to get the elapsed time of Microsoft TTS speech?


I now use following function to realize a TTS service.

int tts(LPCWSTR text){
    ::CoInitialize(NULL);         
    CLSID CLSID_SpVoice;
    CLSIDFromProgID(_T("SAPI.SpVoice"), &CLSID_SpVoice);
    ISpVoice *pSpVoice = NULL;
    IEnumSpObjectTokens *pSpEnumTokens = NULL;

    if (FAILED(CoCreateInstance(CLSID_SpVoice, NULL, CLSCTX_INPROC_SERVER, IID_ISpVoice, (void**)&pSpVoice))){
        return -1;
    }

    if (SUCCEEDED(SpEnumTokens(SPCAT_VOICES, NULL, NULL, &pSpEnumTokens))){
        ISpObjectToken *pSpToken = NULL;
        SpFindBestToken(SPCAT_VOICES, L"Gender=Male", L"Name=Microsoft Simplified Chinese", &pSpToken);
        pSpVoice->SetVoice(pSpToken);
        pSpVoice->Speak(text, SPF_DEFAULT, NULL);
        pSpEnumTokens->Release();        
    }

    pSpVoice->Release();
    ::CoUninitialize();
    return 0;
}

Is it possible for me to get the elapsed time of each character being spoken? Or is it constant (if the speech rate is set)? The purpose is that I want to show some facial animations to match the speech...


Solution

  • You don't necessarily need the elapsed time; you could use Viseme events to trigger your animations.

    Since you're using C++, use ISpVoice::SetInterest to describe the set of events that you want, and one of the ISpNotifySource methods (depending on what your outer code is doing) to get events delivered to you.

    Microsoft has a detailed workthrough available in case this sketch isn't helpful.

    Note that the exact set of visemes (and what visemes map to which values) are language dependent. For US English, the visemes are defined here. For Chinese, the mapping from phoneme to viseme isn't publically defined.

    On the other hand, you can use the viseme events as triggers, and not really care about the actual viseme values.

    Assuming you have a message loop somewhere in your app, your code would look like this:

    CLSID CLSID_SpVoice;
    CLSIDFromProgID(_T("SAPI.SpVoice"), &CLSID_SpVoice);
    ISpVoice *pSpVoice = NULL;
    IEnumSpObjectTokens *pSpEnumTokens = NULL;
    ULONGLONG  ullMyEvents = SPFEI(SPEI_VISEME);
    
    if (FAILED(CoCreateInstance(CLSID_SpVoice, NULL, CLSCTX_INPROC_SERVER, IID_ISpVoice, (void**)&pSpVoice))){
        return -1;
    }
    
    if (SUCCEEDED(SpEnumTokens(SPCAT_VOICES, NULL, NULL, &pSpEnumTokens))){
        ISpObjectToken *pSpToken = NULL;
        SpFindBestToken(SPCAT_VOICES, L"Gender=Male", L"Name=Microsoft Simplified Chinese", &pSpToken);
        pSpVoice->SetVoice(pSpToken);
        // Set type of events the client is interested in.
        pSpVoice->SetInterest(ullMyEvents, ullMyEvents);
        // deliver a WM_APP message when a SAPI event arrives.   
        // Use a different message ID for real code.
        pSpVoice->SetNotifyWindowMessage(hWnd, WM_APP, 0, 0);
        pSpVoice->Speak(text, SPF_DEFAULT, NULL);
        pSpEnumTokens->Release();        
    }
    
    pSpVoice->Release();
    

    Later, in your message loop, you need to handle the SAPI message:

      case WM_APP:
         SPEVENT eventItem;
         memset( &eventItem;, 0,sizeof(SPEVENT));
         while( pVoice->GetEvents(1, &eventItem;, NULL ) == S_OK )
         {
           switch(eventItem.eEventId )
           {
              case SPEI_VISEME:
                 .
                 .
                 .
                 break;
    
              default:
                 break;
           }
    
         SpClearEvent( eventItem );