I've been comparing various langchain compatible llama2 runtimes, using langchain llm chain. Having the following parameter overrides:
# llama.cpp:
model_path="../llama.cpp/models/generated/codellama-instruct-7b.ggufv3.Q5_K_M.bin",
n_ctx = 2048,
max_tokens = 2048,
temperature = 0.85,
top_k = 40,
top_p = 0.95,
repeat_penalty = 1.1,
seed = 112358,
# ctransformer:
model="../llama.cpp/models/generated/codellama-instruct-7b.ggufv3.Q5_K_M.bin",
config={
"context_length": 2048,
"max_new_tokens": 2048,
"temperature": 0.85,
"top_k": 40,
"top_p": 0.95,
"repetition_penalty" :1.1,
"seed" : 112358
},
The model is derived from original codellama-7b-instruct, using methods suggested for llama.cpp.
The system and user prompts are the same. And the prompt template is from the codellama paper.
template = """<s>[INST] <<SYS>>
{system}
<</SYS>>
{user} [/INST]"""
system = """You are very helpful coding assistant who can write complete and correct programs in various programming languages, expecially in java and scala."""
The ctransformer based completion is adequate, but the llama.cpp completion is qualitatively bad, often incomplete, repetitive, and sometimes stuck in a repeat loop.
Apart from the overrides, I have verified that the defaults AFAIK are the same for both implementations.
What aspects can I check more, to bring llama.cpp to behave the same, since I'm more interested in using llama.cpp.
One situation when fixed seeds still manage to produce different answers is when the repeated prompts and answers get included into the context window. For LLMs like Mixtral-8x7B-Instruct-v0.1 it's sufficient to issue the same prompt 3 times and include the 3 identical prompts and 2 identical answers into the context window (prompt number 3) to obtain a slightly different answer... despite the fixed seed that normally works correctly (i.e. when the prompt is unaffected by the conversation history).