azure azure-openai azure-ai retrieval-augmented-generation chat-gpt-4

BadRequestError: Context length exceeded the 8192 token limit, resulting in error code 400

I am building a chat flow in Azure AI studio. The goal is to have 3 index lookup and have the LLM compare the difference.

However, if I set top_k as 3, I would have the following error as the LLM prompt input has to take in from 3 sources.

chat : OpenAI API hits BadRequestError: Error code: 400 - {'error':
 {'message': "This model's maximum context length is 8192 tokens.
 However, your messages resulted in 8374 tokens. Please reduce the
 length of the messages.", 'type': 'invalid_request_error', 'param':
 'messages', 'code': 'context_length_exceeded'}} [Error reference:
 https://platform.openai.com/docs/guides/error-codes/api-errors]

This is my current prompt:

# system:
You are a helpful AI assistant

This is what you know:
1. Dataset 1
{% for item in dataOne %}
text: {{item.text}}
{% endfor %}}

2. Dataset 2
{% for item in dataTwo %}
text: {{item.text}}
{% endfor %}}

Considering the past chat history
{% for item in chat_history %}
# user:
{{item.inputs.question}}
# assistant:
{{item.outputs.answer}}
{% endfor %}

# user:
{{question}}

Keep the answer short, brief and concise.

Is there a way to go around this token limit?
Does it make sense to stack 2 LLM blocks such that
- the first block's prompt takes into consideration the data/files and
- the second block's prompt looks at chat history

Edit:

Use a model with higher token input limits

https://learn.microsoft.com/en-sg/answers/questions/1628268/azure-ai-studio-evaluation-error-in-microsoft-lear

Appreciate any help.

Solution

Please note that, the model has a maximum context length limit of 8192 tokens, and if your prompt and conversation history exceed this limit, you’ll receive a BadRequestError.

Sharing some suggestions here, which you can try: Please check if you can reduce the size of your datasets dataOne and dataTwo. Try to include the most relevant information or by summarizing the data.

If the chat history is too long, you could try truncating it within the token limit. However, be aware that this might result in the model losing some context.

Your idea of stacking two LLM blocks is good. The first block could process the datasets and generate a summary or extract key information. The second block could then take this summary along with the chat history to generate the response. This can potentially help in managing the token limit.

Also try using a model with a higher token limit could also be a solution.

Hope this helps.