Search code examples
openai-api

How to calculate tokens while reading a webpage


I am planning to run this code. But I will like to know how many tokens the bot will consume. (saving cost!)

import os
from embedchain import App

# Create a bot instance
os.environ["OPENAI_API_KEY"] = "YOUR API KEY"
elon_bot = App()

# Embed online resources
elon_bot.add("web_page", "https://en.wikipedia.org/wiki/Elon_Musk")
elon_bot.add("web_page", "https://tesla.com/elon-musk")
elon_bot.add("youtube_video", "https://www.youtube.com/watch?v=MxZpaJK74Y4")

# Query the bot
elon_bot.query("How many companies does Elon Musk run?")
# Answer: Elon Musk runs four companies: Tesla, SpaceX, Neuralink, and The Boring Company
 

From:

https://github.com/embedchain/embedchain


Solution

  • One token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words). Ref: https://platform.openai.com/tokenizer

    You can use tiktoken to calculate the number of tokens for particular model.

    https://github.com/openai/tiktoken

    from tiktoken import Tokenizer 
    from tiktoken.models import GPT2 
    
    text = "blah blah .. more text here."
    
    tokenizer = Tokenizer(GPT2) 
    token_count = tokenizer.count_tokens(text) 
    print(f"Token count: {token_count}")
    

    For embedchain, you need to figure out how to extract the text from web pages that you added, and pass it to tiktoken to count.