Search code examples
c#botframeworkskypechatbotazure-cognitive-services

How can I use a Skype audio attachment with the Bing Speech API when using the Bot Framework?


I have a bot created with the Microsoft Bot Framework and that uses Skype as a channel. When the user tries to speak with the bot by sending an audio using one of the mobile apps (Android or iOS) I want to be able to get the audio from the attachments and send it to the Bing Speech API in order to convert it to text.

I'm having some issues doing this, the main problem I believe is the fact that I have to send a WAV to the Bing Speech API. I read the demo in the Bot Builder repository, and in the demo there's the following code:

var audioAttachment = activity.Attachments?.FirstOrDefault(a => a.ContentType.Equals("audio/wav"));
if (audioAttachment != null)
{
    using (var client = new HttpClient())
    {
        var stream = await client.GetStreamAsync(audioAttachment.ContentUrl);
        var text = await this.speechService.GetTextFromAudioAsync(stream);
        message = ProcessText(activity.Text, text);
    }
}

However when I send an audio through the Skype mobile app (I'm testing with Android) I don't have an "audio/wav" file type, the file type (ContentType) comes as just "audio".

When I try to get the audio file in the Bot State Manager API using Postman (the URL looks like this: https://smba.trafficmanager.net/apis/v3/attachments/0-eus-d1-0000000000000/views/original) I get something with the content type of "application/octet-stream", and I don't know if this is an MP3, or WAV, or whatever.

The just few lines I can see inside Postman are just something like this:

ftypmp42isommp42pmoovlmvhd�_ ��_ ���@ymeta!hdlrmdta+keysmdtacom.android.version%ilstdata7.1.1�trak\tkhd�_ ��_ ��@mdia mdhd�_ ��_ ��D��,hdlrsounSoundHandle�minfsmhd$dinfdref url �stbl[stsdKmp4a�D'esds@ww0stts��-�stsz

I download this content to a Stream using the ReadAsStreamAsync method and pass this string to the Bing Speech API, on the following endpoint:

https://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=pt-BR&format=detailed

However this is what I get back:

{"RecognitionStatus":"InitialSilenceTimeout","Offset":11000000,"Duration":0}

In this case it's an audio with audible speech, and it doesn't detect the audio. As I said, I believe the problem is the file type. What is the file type used by Skype, and how can use this file to call the Bing Speech API?


Solution

  • What is the file type used by Skype, and how can use this file to call the Bing Speech API?

    You're right, the problem is the file type. The Bing Speech Api only supports WAV/PCM format currently, if your audio file is not with this format, you'll need try to convert it to PCM.

    If you want to detect if the user attachment is an audio file, you can for example modify your code like this:

    var audioAttachment = activity.Attachments?.FirstOrDefault(a => a.ContentType.Contains("audio"));
    

    Then the real problem now is to convert it to a .wav audio. For C#, you may try use the NAudio package.