I want to stream vertexAI response for that I have prepared the following function which presumably yields the response in Chunks :
import vertexai
import os
import time
from vertexai.language_models import TextGenerationModel
def prompt_ai(prompt):
vertexai.init(project="XXX-YYYY", location="ZZ-PPPP")
parameters = {
"max_output_tokens": 1024,
"temperature": 0.2,
"top_p": 0.8,
"top_k": 40
}
model = TextGenerationModel.from_pretrained("text-bison")
responses = model.predict_streaming(
prompt,
**parameters
)
results = []
#print ("===========>>>> GETTING VERTEX RESPONSE <<<<<================")
for response in responses:
text_chunk = str(response)
yield text_chunk
And this FastAPI Endpoint which uses it :
async def search(ai_prompt: str):
return StreamingResponse(prompt_ai(ai_prompt), media_type='text/event-stream')
Both of which are deployed on Google app engine
But when I try to call it via the following Python script (on my PC) :
import requests
url = "https://myGCPdomain.appspot.com/search"
params = {
"ai_prompt": "Tell me something funny",
}
headers = {
"Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6Ietc..."
}
response = requests.post(url, params=params, headers=headers, stream=True)
for chunk in response.iter_lines():
if chunk:
print(chunk.decode("utf-8"))
It should presumably "Stream" the text response as it comes from the VertexAI, instead I am getting it in One Shot.
What am I missing here ? Appreciate your help.
Note: This isn't a duplicate. This issue is specifically with respect to Google App Engine Platform
According to Google App Engine Documentation
App Engine does not support streaming responses where data is sent in incremental chunks to the client while a request is being processed. All data from your code is collected as described above and sent as a single HTTP response.