Search code examples
pythonartificial-intelligencetokenllama

Issue with Llama 2-7B Model Producing Output Limited to 511 Tokens


I am facing an issue with the Llama 2-7B model where the output is consistently limited to only 511 tokens, even though the model should theoretically be capable of producing outputs up to a maximum of 4096 tokens.

I’ve tried setting the max_tokens parameter to higher values, such as 3000, and have calculated the available tokens by subtracting the prompt tokens from the model’s total token limit (4096 tokens). However, despite these adjustments, I continue to receive outputs capped at 511 tokens.

Here’s a snippet of the code I am using to interact with the model:

import psutil
import os
import warnings
from llama_cpp import Llama

# Suppress warnings
warnings.filterwarnings("ignore")

# Path to the model
model_path = "C:/Llama_project/models/llama-2-7b-chat.Q2_K.gguf"

# Load the model
llm = Llama(model_path=model_path)

# System message to set the behavior of the assistant
system_message = "You are a helpful assistant."

# Function to ask questions
def ask_question(question):
    # Use user input for the question prompt
    prompt = f"Answer the following question: {question}"

    # Calculate the remaining tokens for output based on the model's 4096 token limit
    prompt_tokens = len(prompt.split())  # Rough token count estimate
    max_output_tokens = 4096 - prompt_tokens  # Tokens left for output
    
    # Monitor memory usage before calling the model
    process = psutil.Process(os.getpid())
    mem_before = process.memory_info().rss / 1024 ** 2  # Memory in MB

    # Get the output from the model with the calculated max tokens for output
    output = llm(prompt=prompt, max_tokens=max_output_tokens, temperature=0.7, top_p=1.0)

    # Monitor memory usage after calling the model
    mem_after = process.memory_info().rss / 1024 ** 2  # Memory in MB
    
    # Clean the output and return only the answer text
    return output["choices"][0]["text"].strip()

# Main loop for user interaction
while True:
    user_input = input("Ask a question (or type 'exit' to quit): ")
    
    if user_input.lower() == 'exit':
        print("Exiting the program.")
        break
    
    # Get the model's response
    answer = ask_question(user_input)
    
    # Print only the answer
    print(f"Answer: {answer}")

Problem Details:

  • Model: Llama 2-7B (Q2_K version)
  • Expected Output: I was expecting a response close to the maximum token limit (3000 or more tokens).
  • Actual Output: The output is capped at 511 tokens, regardless of the prompt length.

Tried:

  • Setting max_tokens to 3000 or higher.
  • Calculating the available tokens by subtracting the prompt length from the model’s total token limit.

I would expect the model to generate responses that are close to the token limit (ideally closer to 3000 tokens or more, depending on the input), but it keeps producing output limited to 511 tokens.


Solution

  • Try to add n_ctx to 2048 in Llama constructor, so:

    Llama(n_ctx=2048, model_path=model_path)
    

    This parameters tells model what is the maximum length of the prompt and response combined.