Search code examples
pythonopenai-apichatgpt-apigpt-3gpt-4

OpenAI API: How do I count tokens before(!) I send an API request?


OpenAI's text models have a context length, e.g.: Curie has a context length of 2049 tokens.

They provide max_tokens and stop parameters to control the length of the generated sequence. Therefore the generation stops either when stop token is obtained, or max_tokens is reached.

The issue is: when generating a text, I don't know how many tokens my prompt contains. Since I do not know that, I cannot set max_tokens = 2049 - number_tokens_in_prompt.

This prevents me from generating text dynamically for a wide range of text in terms of their length. What I need is to continue generating until the stop token.

My questions are:

  • How can I count the number of tokens in Python API so that I will set max_tokens parameter accordingly?
  • Is there a way to set max_tokens to the max cap so that I won't need to count the number of prompt tokens?

Solution

  • How do I count tokens before(!) I send an API request?

    As stated in the official OpenAI article:

    To further explore tokenization, you can use our interactive Tokenizer tool, which allows you to calculate the number of tokens and see how text is broken into tokens. Alternatively, if you'd like to tokenize text programmatically, use tiktoken as a fast BPE tokenizer specifically used for OpenAI models.

    How does a tokenizer work?

    A tokenizer can split the text string into a list of tokens, as stated in the official OpenAI example on counting tokens with tiktoken:

    tiktoken is a fast open-source tokenizer by OpenAI.

    Given a text string (e.g., "tiktoken is great!") and an encoding (e.g., "cl100k_base"), a tokenizer can split the text string into a list of tokens (e.g., ["t", "ik", "token", " is", " great", "!"]).

    Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you:

    • whether the string is too long for a text model to process and
    • how much an OpenAI API call costs (as usage is priced by token).

    Which encodings does OpenAI use for its models?

    As of April 2024, tiktoken supports 2 encodings used by OpenAI models (source 1, source 2):

    Encoding name OpenAI models
    o200k_base • GPT-4o models (gpt-4o)
    cl100k_base • GPT-4 models (gpt-4)
    • GPT-3.5 Turbo models (gpt-3.5-turbo)
    • GPT Base models (davinci-002, babbage-002)
    • Embeddings models (text-embedding-ada-002, text-embedding-3-large, text-embedding-3-small)
    • Fine-tuned models (ft:gpt-4, ft:gpt-3.5-turbo, ft:davinci-002, ft:babbage-002)

    Note: The p50k_base and r50k_base encodings were used for models that are deprecated as of April 2024.

    What tokenizer libraries are out there?

    Official OpenAI libraries:

    3rd-party libraries:

    How do I use tiktoken?

    1. Install or upgrade tiktoken: pip install --upgrade tiktoken
    2. Write the code to count tokens, where you have two options.

    OPTION 1: Search in the table above for the correct encoding for a given OpenAI model

    If you run get_tokens_1.py, you'll get the following output:

    9

    get_tokens_1.py

    import tiktoken
    
    def num_tokens_from_string(string: str, encoding_name: str) -> int:
        encoding = tiktoken.get_encoding(encoding_name)
        num_tokens = len(encoding.encode(string))
        return num_tokens
    
    print(num_tokens_from_string("Hello world, let's test tiktoken.", "cl100k_base"))
    

    OPTION 2: Use tiktoken.encoding_for_model() to automatically load the correct encoding for a given OpenAI model

    If you run get_tokens_2.py, you'll get the following output:

    9

    get_tokens_2.py

    import tiktoken
    
    def num_tokens_from_string(string: str, encoding_name: str) -> int:
        encoding = tiktoken.encoding_for_model(encoding_name)
        num_tokens = len(encoding.encode(string))
        return num_tokens
    
    print(num_tokens_from_string("Hello world, let's test tiktoken.", "gpt-3.5-turbo"))
    

    Note: If you take a careful look at the usage field in the OpenAI API response, you'll see that it reports 10 tokens used for an identical message. That's 1 token more than tiktoken. I still haven't figured out why. I tested this in the past. As @Jota mentioned in the comment below, there still seems to be a mismatch between the token usage reported by the OpenAI API response and tiktoken.