Search code examples
pythonhuggingface-transformerslarge-language-modelllama

Problem setting up Llama-2 in Google Colab - Cell-run fails when loading checkpoint shards


I'm trying to use Llama 2 chat (via hugging face) with 7B parameters in Google Colab (Python 3.10.12). I've already obtain my access token via Meta. I simply use the code in hugging face on how to implement the model along with my access token. Here is my code:

!pip install transformers
 
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

token = "---Token copied from Hugging Face and pasted here---"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)

It starts downloading the model but when it reaches Loading checkpoint shards: it just stops running and there is no error:

enter image description here


Solution

  • The issue is with Colab instance running out of RAM. Based on your comments you are using basic Colab instance with 12.7 Gb CPU RAM.

    For LLama model you'll need:

    • for the float32 model about 25 Gb (but you'll need both cpu RAM and same 25 gb GPU ram);
    • for the bfloat16 model around 13 Gb (and still not enough to fit basic Colab Cpu instance, given that you'll also need to calculation on the model);

    Check this link for the details on the required resources: huggingface.co/NousResearch/Llama-2-7b-chat-hf/discussions/3

    Also if you want only to do inference (predictions) on the model I would recommend to use it's quantized 4bit or 8bit versions. Both can be ran on CPU and don't need a lot of memory.