Search code examples
pythonartificial-intelligencecpuctransformers

How to run any quantized GGUF model on CPU for local inference?


In ctransformers library, I can only load around a dozen supported models. How can I run local inference on CPU (not just on GPU) from any open-source LLM quantized in the GGUF format (e.g. Llama 3, Mistral, Zephyr, i.e. ones unsupported in ctransformers)?


Solution

  • llama-cpp-python is my personal choice, because it is easy to use and it is usually one of the first to support quantized versions of new models. To install it for CPU, just run pip install llama-cpp-python. Compiling for GPU is a little more involved, so I'll refrain from posting those instructions here since you asked specifically about CPU inference. I also recommend installing huggingface_hub (pip install huggingface_hub) to easily download models.

    Once you have both llama-cpp-python and huggingface_hub installed, you can download and use a model (e.g. mixtral-8x7b-instruct-v0.1-gguf) like so:

    ## Imports
    from huggingface_hub import hf_hub_download
    from llama_cpp import Llama
    
    ## Download the GGUF model
    model_name = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF"
    model_file = "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" # this is the specific model file we'll use in this example. It's a 4-bit quant, but other levels of quantization are available in the model repo if preferred
    model_path = hf_hub_download(model_name, filename=model_file)
    
    ## Instantiate model from downloaded file
    llm = Llama(
        model_path=model_path,
        n_ctx=16000,  # Context length to use
        n_threads=32,            # Number of CPU threads to use
        n_gpu_layers=0        # Number of model layers to offload to GPU
    )
    
    ## Generation kwargs
    generation_kwargs = {
        "max_tokens":20000,
        "stop":["</s>"],
        "echo":False, # Echo the prompt in the output
        "top_k":1 # This is essentially greedy decoding, since the model will always return the highest-probability token. Set this value > 1 for sampling decoding
    }
    
    ## Run inference
    prompt = "The meaning of life is "
    res = llm(prompt, **generation_kwargs) # Res is a dictionary
    
    ## Unpack and the generated text from the LLM response dictionary and print it
    print(res["choices"][0]["text"])
    # res is short for result
    

    Keep in mind that mixtral is a fairly large model for most laptops and requires ~25+ GB RAM, so if you need a smaller model, try using one like llama-13b-chat-gguf (model_name="TheBloke/Llama-2-13B-chat-GGUF"; model_file="llama-2-13b-chat.Q4_K_M.gguf") or mistral-7b-openorca-gguf (model_name="TheBloke/Mistral-7B-OpenOrca-GGUF"; model_file="mistral-7b-openorca.Q4_K_M.gguf").