Search code examples
dialogflow-esactions-on-google

Is dialog with background audio possible?


One of my goals with learning dialogFlow and programmatic webhooks to handle intens and responses is to see if it's possible to create an audio rich quiz.

In this quiz there's a section where I would like to have multiple questions/responses with a continuous audio background playing at the same time. Is this possible or must the Google Assistant or Google Home Speaker be silent when waiting for user input?

Also, if it is possible with the above, is it also possible to make crossfades between audio backgrounds triggered by events of some kind? That is like a change of scenery without interruption/silence.


Solution

  • There are a number of approaches to this, depending on your exact needs and limitations you have. Not everything is completely possible, but you may be able to get close.

    Dialog with Background Audio

    The easiest way to do this would be to use Google's version of SSML with parallel elements using the <par> and <media> tags. (But be aware that these are non-standard tags, if you want to be able to use them elsewhere.) With this, you would have one "track" for the dialog and one for the audio. It might look something like this:

    <speak><par>
        <media xml:id="track-0" begin="0s" soundLevel="+0dB">
            <audio src="https://actions.google.com/sounds/v1/crowds/crowd_talking.ogg" >crowd talking</audio>
        </media>
        <media xml:id="track-1" begin="0.75s" soundLevel="+0dB">
            <seq>
            <media>
                <speak><p>Well, hello there</p></speak>
            </media>
            <media begin="2.0s">
                <speak><p>How are you?</p></speak>
            </media>
            </seq>
        </media>
    </par></speak>
    

    Is there an easy way to design this?

    You might want to check out the Nightingale Visual SSML Editor which has also been released as an open source project. It can help get you started, but tweak the SSML yourself.

    What about cross-fades?

    Sure! Just indicate that you're fading one track out, and fading another in, starting at some offset to the end of the other track.

    <speak><par>
        <media xml:id="track-0" begin="0s" fadeOutDur="6s">
            <audio src="https://actions.google.com/sounds/v1/alarms/digital_watch_alarm_long.ogg" >digital watch alarm long</audio>
        </media>
        <media xml:id="track-1" begin="track-0.end-6s" fadeInDur="6s">
            <audio src="https://actions.google.com/sounds/v1/human_sounds/baby_cry_long.ogg" >baby cry long</audio>
        </media>
    </par></speak>
    

    And I can have this playing while the microphone is open and listening to a user?

    Well... no. Not with SSML.

    If you think about this as a conversation with someone, we typically require audio cues to know when to speak. Asking a question and then giving quiet space to reply is an excellent way to do this. In person - we have other cues (we can see the other person pause, for example), but if we just have audio, we only have silence.

    The Assistant works based on this sort of conversational model, so wants to make it clear to the user when it is their turn to speak.

    You can argue that for some kinds of "conversations", it is normal to have a different audio cue that prompts the user. And you'd be correct. But the Assistant needs to be a general-purpose Assistant.

    Then how can I do this?

    Since you're making a quiz, you can use the Interactive Canvas to create a page with HTML and JavaScript. With this, you can use an <audio> tag or the MediaStream JavaScript API to create the background audio. With this, you can have the microphone open while this media still plays, but there isn't any event that lets you know when the mic is open so you can duck the audio, which is a bit of a downside.

    This gets around the audio-cue issues because there are visual cues when the microphone is open.

    You can also use the SSML <mark> tag to trigger events in your JavaScript so you know when the SSML from the server begins, ends, or hits other points in the audio stream.

    If there are visual cues, can I use the Interactive Canvas on a smart speaker?

    Well... no.

    But what you can do is use runtime surface capabilities to determine if the Interactive Canvas is supported.

    • If it is - use it (possibly along with SSML).
    • If not, use the SSML method.