I pre-trained a pytorch_model.bin
from a pre-train script. Yet when I load it with the following codes, it raises UnicodeDecodeError
. Codes are as follows:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("/path/to/pytorch_model.bin") # Raise UnicodeDecodeError
The traceback is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1811, in from_pretrained
return cls._from_pretrained(
File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1965, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/transformers/models/bert/tokenization_bert.py", line 218, in __init__
self.vocab = load_vocab(vocab_file)
File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/transformers/models/bert/tokenization_bert.py", line 121, in load_vocab
tokens = reader.readlines()
File "/opt/tljh/user/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
How can I resolve this issue?
Versions:
from_pretrained
take as input the path to the directory containing model weights saved using save_pretrained()
not the bin file .
You can save your model:
model.save_pretrained("my_model_directory")
Then You can load it :
BertTokenizer.from_pretrained("my_model_directory")