Search code examples
pythongoogle-cloud-platformgoogle-app-enginefastapigoogle-cloud-vertex-ai

FastAPI streamingResponse Not Streaming Text response but instead getting it in one shot [on GAE Platform]


I want to stream vertexAI response for that I have prepared the following function which presumably yields the response in Chunks :

import vertexai
import os
import time
from vertexai.language_models import TextGenerationModel



def prompt_ai(prompt):

    vertexai.init(project="XXX-YYYY", location="ZZ-PPPP")
    parameters = {
        "max_output_tokens": 1024,
        "temperature": 0.2,
        "top_p": 0.8,
        "top_k": 40
    }
    model = TextGenerationModel.from_pretrained("text-bison")
    responses = model.predict_streaming(
        prompt,
        **parameters
    )
    results = []
    #print ("===========>>>> GETTING VERTEX RESPONSE <<<<<================")
    for response in responses:
        text_chunk = str(response)
        yield text_chunk

And this FastAPI Endpoint which uses it :

async def search(ai_prompt: str): 
  return StreamingResponse(prompt_ai(ai_prompt), media_type='text/event-stream')

Both of which are deployed on Google app engine

But when I try to call it via the following Python script (on my PC) :

import requests

url = "https://myGCPdomain.appspot.com/search"
params = {
    "ai_prompt": "Tell me something funny",
}

headers = {
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6Ietc..."
}

response = requests.post(url, params=params, headers=headers, stream=True)

for chunk in response.iter_lines():
    if chunk:
        print(chunk.decode("utf-8"))

It should presumably "Stream" the text response as it comes from the VertexAI, instead I am getting it in One Shot.

What am I missing here ? Appreciate your help.

Note: This isn't a duplicate. This issue is specifically with respect to Google App Engine Platform


Solution

  • According to Google App Engine Documentation

    App Engine does not support streaming responses where data is sent in incremental chunks to the client while a request is being processed. All data from your code is collected as described above and sent as a single HTTP response.