Search code examples
pythonopenai-apichatgpt-apiazure-openai

OpenAI API, ChatCompletion and Completion give totally different answers with same parameters. Why?


I'm exploring the usage of different prompts on gpt3.5-turbo.

Investigating over the differences between "ChatCompletion" and "Completion", some references say that they should be more or less the same, for example: https://platform.openai.com/docs/guides/gpt/chat-completions-vs-completions

Other sources say, as expected, that ChatCompletion is more useful for chatbots, since you have "roles" (system, user and assistant), so that you can orchestrate things like few-shot examples and/or memory of previous chat messages. While Completion is more useful for summarization, or text generation.

But the difference seems to be much bigger. I can't find references where they explain what is happening under the hood.

The following experiment gives me totally diferent results, even when using the same model with the same parameters.

With ChatCompletion

import os
import openai
openai.api_type = "azure"
openai.api_version = "2023-03-15-preview"
openai.api_base = ...
openai.api_key = ...

chat_response = openai.ChatCompletion.create(
  engine="my_model", # gpt-35-turbo
  messages = [{"role":"user","content":"Give me something intresting:\n"}],
  temperature=0,
  max_tokens=800,
  top_p=0.95,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None)

print(chat_response.choices[0]['message']['content'])

Result is a fact about a war:

Did you know that the shortest war in history was between Britain and Zanzibar in 1896? It lasted only 38 minutes!

With Completion

regular_response = openai.Completion.create(
  engine="my_model", # gpt-35-turbo
  prompt="Give me something intresting:\n",
  temperature=0,
  max_tokens=800,
  top_p=0.95,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None)

print(regular_response['choices'][0]['text'])

Result is a python code and some explanation of what it does:

    ```
    import random
    import string
    
    def random_string(length):
        return ''.join(random.choice(string.ascii_letters) for i in range(length))
    
    print(random_string(10))
    ```
    Output:
    ```
    'JvJvJvJvJv'
    ```
    This code generates a random string of length `length` using `string.ascii_letters` and `random.choice()`. `string.ascii_letters` is a string containing all ASCII letters (uppercase and lowercase). `random.choice()` returns a random element from a sequence. The `for` loop generates `length` number of random letters and `join()` concatenates them into a single string. The result is a random string of length `length`. This can be useful for generating random passwords or other unique identifiers.<|im_end|>

Notes

  1. I'm using the same parameters (temperature, top_p, etc). The only difference is the ChatCompletion/Completion api.
  2. The model is the same in both cases, gpt-35-turbo.
  3. I'm keeping the temperature low so I can get more consistent results.
  4. Other prompts also give totally different answers, like if I try something like "What is the definition of song?"

The Question

  • Why is this happening?
  • Shouldn't same prompts give similar results given that they are using the same model?
  • Is there any reference material where OpenAI explains what it is doing under the hood?

Solution

  • I actually found the answer by chance reviewing some old notebooks.

    It's all on the hidden tags, or as I found out now, the Chat Markup Language (ChatML): https://github.com/openai/openai-python/blob/main/chatml.md

    This prompt with the Completion api now returns almost the same answer as the ChatCompletion:

    prompt = """<|im_start|>system
    <|im_end|>
    <|im_start|>user
    Give me something intresting:
    <|im_end|>
    <|im_start|>assistant
    """
    
    regular_response = openai.Completion.create(
      engine="my_model", # gpt-35-turbo
      prompt=prompt,
      temperature=0,
      max_tokens=800,
      top_p=0.95,
      frequency_penalty=0,
      presence_penalty=0,
      stop=None)
    
    print(regular_response['choices'][0]['text'])
    

    Result now is the same fact about a war (with the ending tag):

    Did you know that the shortest war in history was between Britain and Zanzibar in 1896? The war lasted only 38 minutes, with the British emerging victorious.<|im_end|>
    

    It seems that all that the ChatCompletion api is doing is adding those tags in between your prompts.