Actions on Google webhook call - add a delay to speech response

I'm trying to make my own Google Assistant Actions. I'm sending a Json response back to my Google Assistant. Google Assistant receives the response and reads the text. So far all good.

Now I want to do the following. Google Assistant should read: "Test 123" and make then a 1 second delay... Then finally read "Test321"

How should I adjust my Json response so that it works with the delay? Is it possible?

I generate JSON response Object via ASP.NET

My Main Class

[HttpPost]
public async Task<IActionResult> PostWebHook()
{

    Google_Assistant_Request_Json.RequestJson request = new Google_Assistant_Request_Json.RequestJson(); //Request Object
    Google_Assistant_Response_Json.ResponseJson response = new Google_Assistant_Response_Json.ResponseJson(); //Response Object

    string body;
    using (var reader = new StreamReader(Request.Body))
    {
        
        body = await reader.ReadToEndAsync();
        request = JsonConvert.DeserializeObject<Google_Assistant_Request_Json.RequestJson>(body);

            response.session.id = request.session.id;
            response.prompt.@override = false;
            response.prompt.firstSimple.speech = "Test123";
            response.prompt.lastSimple.speech = "Test321";
      
    }

    return Ok(response);
}

Solution

The easiest way is to use SSML as your speech response and to include a <break> tag to generate the pause.

So your SSML might look something like:

<speak>
Test 123 <break time="1s"/> Test 321
</speak>

Note that this should be what you use for the "speech" field in your response. The "text" field should not include the SSML markup and there is no way by default to make sure that the second part of the text appears after a 1 second delay - all the text is shown at once. (There are some advanced techniques that involve the Interactive Canvas and the SSML <mark> tag, however these have some restrictions on their use.)

Based on your code, it might look something like this:

response.prompt.firstSimple.speech = "<speak>Test123 <break time='1s'/> Test321</speak>";
response.prompt.firstSimple.text   = "Test123 Test321";

or, if you have a good reason to be using both firstSimple and lastSimple

response.prompt.firstSimple.speech = "<speak>Test123 <break time='1s'/></speak>";
response.prompt.firstSimple.text   = "Test123";
response.prompt.lastSimple.speech  = "<speak>Test321</speak>";
response.prompt.lastSimple.text    = "Test321";