I am currently only able to play around with a V100 on GCP. I understand that I can load a LLM in 4bit quantization as shown below. However, (assuming due to the quantization) it is taking up to 10 minutes to load this model.
Is there a way to speed up this loading process?
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
device = "cuda" # the device to load the model onto
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
You havent provided much details about you docker setup. But yes every time you run this docker you will have to download files, until you build your own image which will copy model files into docker, then you can use cache_dir
parameter in from_pretrained to point to location of your model.
I am able to load llama3 8b into Tesla M40 in few seconds.