Search code examples
dialogflow-esactions-on-googlegoogle-assistant-sdkgoogle-home

Having several GoogleResponses in a row without user input or interaction


I am working on a cooking recipe app for google home and I need a way to string several GoogleResponses (SimpleResponse etc..) together without requiring user interaction between them.

I have searched for other answers pertaining to this, and while I have found a few similar questions to mine, the replies tend to be along the lines of "the system was designed for dialogues so what would be the point?".

I fully understand this point of view, however because of the nature and behaviour requirements of the app that I am developing I find myself in need of this particular possibility.

The recipes are divided into steps (revolutionary, I know..) and there is roughly a 1 to 1 correspondence between steps and GoogleResponses.

To give an example of how a typical recipe unfolds it is usually like this (this is a simplification of course):

main content -> question -> main content -> question -> etc..

With each instance of "main content" being a step of the recipe and each "question" requiring user input.

If if was just like this all the time then there would not be a problem, I could just bundle each "main content -> question" section into one GoogleResponse and be done.

However there are often times where the recipe flows more like:

main content -> main content -> main content -> question

With each "main content" being a step in the recipe, it does not make sense in this context to bundle them together into the same response (there is a system for the user to move back and forth between steps).

I was originally using MediaResponses for the "main content" sections as those do not require user input to move onto the next step, but due to various reasons I won't go into here as this is already getting quite long, the project manager has decided that MediaResponses should not be used in this project.


Solution

  • The short answer is the one you already encountered - trying to make conversational actions not-so-conversational doesn't work very well. However, there are a few things you can look into.

    Recipe Structured Data

    Since you're working on a recipe action, specifically, it may be worthwhile to use the standard recipe support that comes with the Assistant.

    On the upside - people will be familiar with it, and you don't need to do much code, just provide markup on a webpage.

    On the downside - if you have other requirements for how you want the interaction to go, it isn't that flexible. (For example, if you're asking questions at some of the recipe points, or if you want to offer measurement adjustments based on number of people to serve.)

    Misuse the "No Input" event

    You can configured dynamic reprompts so you get an event if the user doesn't say anything after a few seconds. If they want to speed a reply, they could ask for the next context specifically, or you can catch the actions_intent_NO_INPUT event in Dialogflow and advance yourself.

    There are a few downsides here:

    • Not all devices support no-input. In particular, for example, mobile devices won't generate this.
    • This may only be valid for two no-input events in a row. On the third event, the Assistant may automatically close the conversation. (The documentation is unclear on this, and the exact behavior has changed over time.)

    Media Response

    You're not clear why using Media Response "shouldn't be used", but this is one of the only ways way to trigger an event when speaking is completed.

    There are several downsides, however:

    • There are a number of bugs with Media Response around quitting
    • On devices with screens, there is a media player. Since the media itself is incidental to what you're doing, having the player doesn't make sense
    • It isn't supported on all surfaces

    Interactive Canvas

    A similar approach, however, would be to use the Interactive Canvas. This gives you an HTML page with JavaScript that you control, including being able to generate responses to the server as if the user spoke them (or as if they touched a suggestion chip). You can also listen to events for when the generated speech has finished.

    There are, however, a number of downsides which probably prevent you from using this right now:

    • The biggest is that the Interactive Canvas can only be used for games right now. (But this seems to be a policy decision, rather than a technical one. So perhaps it will be lifted in the future.)
    • It does not work on smart speakers - only some devices with screens.

    Combining the above approaches

    One way to get around the device limitations of the Interactive Canvas and the poor visuals that accompany Media Response might be to mix the two. For devices that support IC, use that. If not, try using Media Response. (You may even wish to consider the no-input reprompt for some platforms.)

    But this still won't work on all devices, and still has the limitation that Interactive Canvas is only for games right now.

    Summary

    There is no one, clear, way to handle this... and this isn't a feature they are likely to add given the conversational nature of the platform. However, there may be some workarounds which might work for your scenario.