OpenAI's text models have a context length, e.g.: Curie has a context length of 2049 tokens.
They provide max_tokens
and stop
parameters to control the length of the generated sequence. Therefore the generation stops either when stop token is obtained, or max_tokens
is reached.
The issue is: when generating a text, I don't know how many tokens my prompt contains. Since I do not know that, I cannot set max_tokens = 2049 - number_tokens_in_prompt
.
This prevents me from generating text dynamically for a wide range of text in terms of their length. What I need is to continue generating until the stop token.
My questions are:
max_tokens
parameter accordingly?max_tokens
to the max cap so that I won't need to count the number of prompt tokens?As stated in the official OpenAI article:
To further explore tokenization, you can use our interactive Tokenizer tool, which allows you to calculate the number of tokens and see how text is broken into tokens. Alternatively, if you'd like to tokenize text programmatically, use tiktoken as a fast BPE tokenizer specifically used for OpenAI models.
A tokenizer can split the text string into a list of tokens, as stated in the official OpenAI example on counting tokens with tiktoken:
tiktoken is a fast open-source tokenizer by OpenAI.
Given a text string (e.g.,
"tiktoken is great!"
) and an encoding (e.g.,"cl100k_base"
), a tokenizer can split the text string into a list of tokens (e.g.,["t", "ik", "token", " is", " great", "!"]
).Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you:
- whether the string is too long for a text model to process and
- how much an OpenAI API call costs (as usage is priced by token).
As of April 2024, tiktoken supports 2 encodings used by OpenAI models (source 1, source 2):
Encoding name | OpenAI models |
---|---|
o200k_base |
• GPT-4o models (gpt-4o ) |
cl100k_base |
• GPT-4 models (gpt-4 )• GPT-3.5 Turbo models ( gpt-3.5-turbo )• GPT Base models ( davinci-002 , babbage-002 )• Embeddings models ( text-embedding-ada-002 , text-embedding-3-large , text-embedding-3-small )• Fine-tuned models ( ft:gpt-4 , ft:gpt-3.5-turbo , ft:davinci-002 , ft:babbage-002 ) |
Note: The p50k_base
and r50k_base
encodings were used for models that are deprecated as of April 2024.
Official OpenAI libraries:
3rd-party libraries:
pip install --upgrade tiktoken
OPTION 1: Search in the table above for the correct encoding for a given OpenAI model
If you run get_tokens_1.py
, you'll get the following output:
9
get_tokens_1.py
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
print(num_tokens_from_string("Hello world, let's test tiktoken.", "cl100k_base"))
OPTION 2: Use tiktoken.encoding_for_model()
to automatically load the correct encoding for a given OpenAI model
If you run get_tokens_2.py
, you'll get the following output:
9
get_tokens_2.py
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
encoding = tiktoken.encoding_for_model(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
print(num_tokens_from_string("Hello world, let's test tiktoken.", "gpt-3.5-turbo"))
Note: If you take a careful look at the usage field in the OpenAI API response, you'll see that it reports 10 tokens used for an identical message. That's 1 token more than tiktoken. I still haven't figured out why. I tested this in the past. As @Jota mentioned in the comment below, there still seems to be a mismatch between the token usage reported by the OpenAI API response and tiktoken.