python artificial-intelligence cpu ctransformers

How to run any quantized GGUF model on CPU for local inference?

In ctransformers library, I can only load around a dozen supported models. How can I run local inference on CPU (not just on GPU) from any open-source LLM quantized in the GGUF format (e.g. Llama 3, Mistral, Zephyr, i.e. ones unsupported in ctransformers)?

Solution

llama-cpp-python is my personal choice, because it is easy to use and it is usually one of the first to support quantized versions of new models. To install it for CPU, just run pip install llama-cpp-python. Compiling for GPU is a little more involved, so I'll refrain from posting those instructions here since you asked specifically about CPU inference. I also recommend installing huggingface_hub (pip install huggingface_hub) to easily download models.

Once you have both llama-cpp-python and huggingface_hub installed, you can download and use a model (e.g. mixtral-8x7b-instruct-v0.1-gguf) like so:

## Imports
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## Download the GGUF model
model_name = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF"
model_file = "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" # this is the specific model file we'll use in this example. It's a 4-bit quant, but other levels of quantization are available in the model repo if preferred
model_path = hf_hub_download(model_name, filename=model_file)

## Instantiate model from downloaded file
llm = Llama(
    model_path=model_path,
    n_ctx=16000,  # Context length to use
    n_threads=32,            # Number of CPU threads to use
    n_gpu_layers=0        # Number of model layers to offload to GPU
)

## Generation kwargs
generation_kwargs = {
    "max_tokens":20000,
    "stop":["</s>"],
    "echo":False, # Echo the prompt in the output
    "top_k":1 # This is essentially greedy decoding, since the model will always return the highest-probability token. Set this value > 1 for sampling decoding
}

## Run inference
prompt = "The meaning of life is "
res = llm(prompt, **generation_kwargs) # Res is a dictionary

## Unpack and the generated text from the LLM response dictionary and print it
print(res["choices"][0]["text"])
# res is short for result

Keep in mind that mixtral is a fairly large model for most laptops and requires ~25+ GB RAM, so if you need a smaller model, try using one like llama-13b-chat-gguf (model_name="TheBloke/Llama-2-13B-chat-GGUF"; model_file="llama-2-13b-chat.Q4_K_M.gguf") or mistral-7b-openorca-gguf (model_name="TheBloke/Mistral-7B-OpenOrca-GGUF"; model_file="mistral-7b-openorca.Q4_K_M.gguf").