huggingface-transformers large-language-model quantization

Speeding up load time of LLMs

I am currently only able to play around with a V100 on GCP. I understand that I can load a LLM in 4bit quantization as shown below. However, (assuming due to the quantization) it is taking up to 10 minutes to load this model.

Is there a way to speed up this loading process?

I see that there is GGUF file format which may help in this regard (although I am not sure why/ how).
Would doing torch.compile somehow help me load the model next time in a fast manner. My hypothesis being that when compiled, I can save the resulting model in a binary format that can load faster?
Should I be baking the loaded model into the docker image somehow to speed this up? The downside being due to cuda the docker image is already at 4GB.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

device = "cuda" # the device to load the model onto
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

Solution

You havent provided much details about you docker setup. But yes every time you run this docker you will have to download files, until you build your own image which will copy model files into docker, then you can use cache_dir parameter in from_pretrained to point to location of your model.

I am able to load llama3 8b into Tesla M40 in few seconds.