Search code examples
pythontensorflowpytorchbert-language-modelhuggingface-transformers

Huggingface Electra - Load model trained with google implementation error: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte


I have trained an electra model from scratch using google implementation code.

python run_pretraining.py --data-dir gc://bucket-electra/dataset/ --model-name greek_electra --hparams hparams.json

with this json hyperparams:

{
"embedding_size": 768,
"max_seq_length": 512,
"train_batch_size": 128,
"vocab_size": 100000,
"model_size": "base",
"num_train_steps": 1500000
}

After having trained the model, I used the convert_electra_original_tf_checkpoint_to_pytorch.py script from transformers library to convert the checkpoint.

python convert_electra_original_tf_checkpoint_to_pytorch.py --tf_checkpoint_path output/models/transformer/greek_electra --config_file resources/hparams.json --pytorch_dump_path output/models/transformer/discriminator  --discriminator_or_generator "discriminator"

Now I am trying to load the model:

from transformers import ElectraForPreTraining

model = ElectraForPreTraining.from_pretrained('discriminator')

but I get the following error:

Traceback (most recent call last):
  File "~/.local/lib/python3.9/site-packages/transformers/configuration_utils.py", line 427, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "~/.local/lib/python3.9/site-packages/transformers/configuration_utils.py", line 510, in _dict_from_json_file
    text = reader.read()
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Any ideas what's causing this & how to solve it?


Solution

  • It seems that @npit is right. The output of the convert_electra_original_tf_checkpoint_to_pytorch.py does not contain the configuration that I gave (hparams.json), therefore I created an ElectraConfig object -- with the same parameters -- and provided it to the from_pretrained function. That solved the issue.