I am working on the bert-base-mutilingual-uncased
model but when I try to set the TOKENIZER
in the config
it throws an OSError
.
class config:
DEVICE = "cuda:0"
MAX_LEN = 256
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 1
BERT_PATH = {"bert-base-multilingual-uncased": "workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased"}
MODEL_PATH = "workspace/data/jigsaw-multilingual/model.bin"
TOKENIZER = transformers.BertTokenizer.from_pretrained(
BERT_PATH["bert-base-multilingual-uncased"],
do_lower_case=True)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-33-83880b6b788e> in <module>
----> 1 class config:
2 # def __init__(self):
3
4 DEVICE = "cuda:0"
5 MAX_LEN = 256
<ipython-input-33-83880b6b788e> in config()
11 TOKENIZER = transformers.BertTokenizer.from_pretrained(
12 BERT_PATH["bert-base-multilingual-uncased"],
---> 13 do_lower_case=True)
/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, *inputs, **kwargs)
1138
1139 """
-> 1140 return cls._from_pretrained(*inputs, **kwargs)
1141
1142 @classmethod
/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1244 ", ".join(s3_models),
1245 pretrained_model_name_or_path,
-> 1246 list(cls.vocab_files_names.values()),
1247 )
1248 )
OSError: Model name 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was not
found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking,
bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc,
bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1,
wietsedv/bert-base-dutch-cased).
We assumed 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such
vocabulary files at this path or url.
As I can interpret the error, it says that the vocab.txt
file was not found at the given location but actually its present.
Following are the files available in the bert-base-multilingual-uncased
folder:
config.json
pytorch_model.bin
vocab.txt
I am new to working with bert
, so I am not sure if there is a different way to define the tokenizer.
I think this should work:
from transformers import BertTokenizer
TOKENIZER = BertTokenizer.from_pretrained('bert-base-multilingual-uncased', do_lower_case=True)
It will download the tokenizer from huggingface.