I am trying to load a large Hugging face model with code like below:
model_from_disc = AutoModelForCausalLM.from_pretrained(path_to_model)
tokenizer_from_disc = AutoTokenizer.from_pretrained(path_to_model)
generator = pipeline("text-generation", model=model_from_disc, tokenizer=tokenizer_from_disc)
The program is quickly crashing after the first line because it is running out of memory. Is there a way to chunk the model as I am loading it, so that the program doesn't crash?
EDIT
See cronoik's answer for accepted solution, but here are the relevant pages on Hugging Face's documentation:
Sharded Checkpoints: https://huggingface.co/docs/transformers/big_models#sharded-checkpoints:~:text=in%20the%20future.-,Sharded%20checkpoints,-Since%20version%204.18.0
Large Model Loading: https://huggingface.co/docs/transformers/main_classes/model#:~:text=the%20weights%20instead.-,Large%20model%20loading,-In%20Transformers%204.20.0
You could try to load it with low_cpu_mem_usage:
from transformers import AutoModelForSeq2SeqLM
model_from_disc = AutoModelForCausalLM.from_pretrained(path_to_model, low_cpu_mem_usage=True)
Please note that low_cpu_mem_usage
requires:
Accelerate >= 0.9.0 and PyTorch >= 1.9.0.