Search code examples
pythonopenai-api

OpenAI API returning parts of words as most probable token


It is my first time working with the OpenAI API (also my first time anywhere near language models for that matter). I would like to obtain, given an uncomplete sentence missing the last word, the top words that could be used to complete it and their probabilities. For example, for the sentence "My least favourite food is...", I would like to know which words could be used to complete that sentence and with which probabilities.

The problem is that, for some sentences, the model is not returning words but what I assume are parts of words. For example, for the sentence above, the top three results are 'bro', 'an', and 'spi' (which I suspect are broccoli, anxovies, and spinach). For some other sentences (for example, "Yesterday I went to the...") it seems to give good responses ('park' and 'supermarket').

I am using the get_completion() function defined here.

from openai import OpenAI
from math import exp
import numpy as np
from IPython.display import display, HTML
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>")) # Adding my API key

def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4",
    max_tokens=500,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

After defining the function, I use this code:

# Define sentence list
sentence_list = [
    "Yesterday I went to the",
    "My least favorite food is"
]

results = [] # Empty list to store results

for sentence in sentence_list:
    PROMPT = f"Predict the next word to end this sentence: {sentence}"
    API_RESPONSE = get_completion(
        [{"role": "user", "content": PROMPT.format(sentence=sentence)}],
        model="gpt-4",
        logprobs=True,
        top_logprobs=3,
        )
    for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
# Store original sentence, word, logprobs and linear probability
        tokenResults = [sentence,token.token,token.logprob,float(np.round(np.exp(token.logprob)*100,2))] 
        results.append(tokenResults)

Is there a way I can force it to give me the full words and their probabilities? I know virtually nothing about language models so I don't know if these results are expected or I am doing something wrong. I have tried changing the prompt but the results are always the same.

I appreciate any help. Also, I hope the question format is OK, it is my first time posting.


Solution

  • To answer your question directly: These results are expected, and you did everything correctly. Unfortunately, there may not be an easy way to achieve exactly what you're asking for.

    Large Language Models (LLMs) do not always generate complete words in their responses. Instead, they generate responses in tokens, which may or may not correspond to full words. In other words, tokens are the basic units of language generation. You can learn more about this in the Hugging Face NLP Course.

    For OpenAI's GPT-3.5 model, here are some relevant tokens in the model's vocabulary:

    15222  bro  
    60803  ccoli  
    42682  spin  
    613    ach  
    9712   super  
    19859  market  
    29836  park  
    

    In order to generate the word "broccoli", for example, the model needs to generate at least two tokens: [15222, 60803]. This is because the full word "broccoli" doesn't exist as a single token in the vocabulary. (You can explore this on OpenAI’s Tokenizer page).

    The Completion API supports logprobs, but only for each predicted token position. It does not provide probabilities for a future token position, which is what you're trying to achieve.

    Note: OpenAI's Completions API is now Legacy.

    Most developers should use the Chat Completions API to leverage the latest and best models.