Could not parse ModelProto from Meta-Llama-3.1-8B-Instruct/tokenizer.model

I tried to use Llama 3.1 without relying on external programs, but I was not successful. I downloaded the Meta-Llama-3.1-8B-Instruct model, which includes only the files consolidated.00.pth, params.json, and tokenizer.model.

The params.json file contains the following configuration:

{
  "dim": 4096,
  "n_layers": 32,
  "n_heads": 32,
  "n_kv_heads": 8,
  "vocab_size": 128256,
  "ffn_dim_multiplier": 1.3,
  "multiple_of": 1024,
  "norm_eps": 1e-05,
  "rope_theta": 500000.0,
  "use_scaled_rope": true
}

Can you guide me on how to use this model?

I have tried the following code:

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, LlamaConfig

model_path = 'Meta-Llama-3.1-8B-Instruct'
tokenizer_path = f'{model_path}/tokenizer.model'

# Load tokenizer
tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)

# Configure the model
model_config = LlamaConfig(
    hidden_size=4096,
    num_hidden_layers=32,
    num_attention_heads=32,
    intermediate_size=5324.8,  # This value is calculated as 4096 * 1.3
    vocab_size=128256,
    use_scaled_rope=True
)

# Load the model
model = LlamaForCausalLM(config=model_config)
model.load_state_dict(torch.load(f'{model_path}/consolidated.00.pth'))

model.eval()

# Tokenize and generate output
input_text = "Hello, how are you?"
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model.generate(inputs['input_ids'])

# Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded_output)

However, I got the following error:

(venv) PS C:\Users\Main\Desktop\mygguf> python app.py
C:\Users\Main\Desktop\mygguf\venv\Lib\site-packages\transformers\tokenization_utils_base.py:2165: FutureWarning: Calling LlamaTokenizer.from_pretrained() with the path to a single file or url is deprecated and won't be possible anymore in v5. Use a model identifier or the path to a directory instead.
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
Traceback (most recent call last):
  File "C:\Users\Main\Desktop\mygguf\app.py", line 9, in <module>
    tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Main\Desktop\mygguf\venv\Lib\site-packages\transformers\tokenization_utils_base.py", line 2271, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Main\Desktop\mygguf\venv\Lib\site-packages\transformers\tokenization_utils_base.py", line 2505, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Main\Desktop\mygguf\venv\Lib\site-packages\transformers\models\llama\tokenization_llama.py", line 171, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Main\Desktop\mygguf\venv\Lib\site-packages\transformers\models\llama\tokenization_llama.py", line 198, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "C:\Users\Main\Desktop\mygguf\venv\Lib\site-packages\sentencepiece\__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Main\Desktop\mygguf\venv\Lib\site-packages\sentencepiece\__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: could not parse ModelProto from Meta-Llama-3.1-8B-Instruct/tokenizer.model

Solution

The way you should think about using llm model is that you have to pass it information systematically.

Since you are using a publicly available model they come with things like weights, cfg etc... so you don't need to declare yours.

All you need do is to start by declaring the file-paths of your model(i.e where you downloaded it).

Also there is tokenism (tokens are simply vectors which models understand they usually map it with the given words you ask it). If the output is not the desired, You can use different tokenizers

You can look up the process of using different tokens or tokenizers such as BERT, All-net etc here is a link to a blog

You should also spend sometime on Huggingface website here is the link hugging_face

Here is a snippet of how to use the model, I have provided comments on what each line does. I hope it helps you!

import torch
from transformers import AutoTokenizer, AutoModel
from transformers import LlamaTokenizer, LlamaForCausalLM, LlamaConfig

model_path = 'Meta-Llama-3.1-8B-Instruct'


# Load the tokenizer directly from the model path
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Load model configuration from params.json
config = LlamaConfig.from_json_file(f'{model_path}/params.json')

# load the model with the specific configs. 
model = LlamaForCausalLM(config=config)

# Load the weights of the model
state_dict = torch.load(f'{model_path}/consolidated.00.pth', map_location=torch.device('cpu'))
model.load_state_dict(state_dict)

model.eval()

# generate tokens and generate output
input_text = "Hello, how are you?"
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model.generate(inputs['input_ids'])

# print the output you asked it 
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output)