Search code examples
openai-apigpt-4openai-assistants-api

OpenAI Assistants API: Why does a single question I ask my assistant spend so many tokens?


I have a NodeJS program that connects to OpenAI's assistant API to create messages. I have followed this documentation from OpenAI to create the steps below:

  1. I have created an Assistant (gpt-4-1106-preview) and a thread in that Assistant that I'm accessing to interact with.
  2. Add a message to the thread. The message contains around 1000 tokens, checked via https://platform.openai.com/tokenizer
    openai.beta.threads.messages.create(threadId, {
        role: "user",
        content: createMessage(),
    });
  1. Run the assistant
    await openai.beta.threads.runs.create(threadId, {
        assistant_id: assistantId,
        instructions:
             "Please address the user as Mahesh. The user is an administrator.",
    });
  1. Check the status. I'm running this every 5 seconds until the status is "completed"
    await openai.beta.threads.runs.retrieve(threadId, runId);
  1. Get the last response from the Assistant
   const messages = await openai.beta.threads.messages.list(threadId, {
      limit: 1,
   });

This code takes around 250,000 tokens to complete. The image shows today's token usage for three requests.

enter image description here


Solution

  • There could be multiple reasons why your cost of running an assistant is very high.

    What OpenAI model do you use?

    If you take a look at the official OpenAI documentation, you'll see that they use the gpt-4-1106-preview model. They state:

    We recommend using OpenAI’s latest models with the Assistants API for best results and maximum compatibility with tools.

    But older models might be good enough. It depends on what your assistant is used for. You can lower the cost of running the assistant just by changing the model. Of course, if you see that the performance of the assistant is considerably worse, then you need to use the latest models. Just take a look at the table below to see what a difference a model decision can make:

    MODEL INPUT OUTPUT
    gpt-4-1106-preview $0.01 / 1K tokens $0.03 / 1K tokens
    gpt-3.5-turbo-1106 $0.001 / 1K tokens $0.002 / 1K tokens

    How long have you been using the same thread?

    As stated in the official OpenAI documentation:

    Assistants can access persistent threads. Threads simplify AI application development by storing message history and truncating it when the conversation gets too long for the model’s context length. You create a thread once, and simply append messages to it as your users reply.

    / ... /

    Threads and messages represent a conversation session between an assistant and a user. There is no limit to the number of messages you can store in a thread. Once the size of the messages exceeds the context window of the model, the thread will attempt to include as many messages as possible that fit in the context window and drop the oldest messages.

    The tread is storing the message history! The gpt-4-1106-preview has a context window of 128,000 tokens. So, if you chat with your assistant using the same thread long enough, you will fill up the thread up to the context window of your chosen model.

    If you choose the gpt-4-1106-preview this means that after some time chatting with your assistant using the same thread, a single question you ask your assistant means that you used 128,000 tokens. Your recent question might contain 1,000 tokens, but you also need to keep in mind that hundreds of messages that were either asked by you or answered by the assistant in the past were also sent to the Assistants API.

    In your case, you can see that today you spent 760,564 context tokens. You have probably been using the same thread for quite some time.

    How often do you check the run status?

    You said that you check the run status to see if it has been moved to completed every 5 seconds. Try to increase this number, let's say 10 seconds, to make fewer API calls. You pay for every API call you make.