Search code examples
pythonnlppytorchbert-language-modelhuggingface-transformers

Python: BERT Tokenizer cannot be loaded


I am working on the bert-base-mutilingual-uncased model but when I try to set the TOKENIZER in the config it throws an OSError.

Model Config

class config: 
    DEVICE = "cuda:0"
    MAX_LEN = 256
    TRAIN_BATCH_SIZE = 8
    VALID_BATCH_SIZE = 4
    EPOCHS = 1

    BERT_PATH = {"bert-base-multilingual-uncased": "workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased"}
    MODEL_PATH = "workspace/data/jigsaw-multilingual/model.bin"

    TOKENIZER = transformers.BertTokenizer.from_pretrained(
            BERT_PATH["bert-base-multilingual-uncased"], 
            do_lower_case=True)

Error

    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    <ipython-input-33-83880b6b788e> in <module>
    ----> 1 class config:
          2 #     def __init__(self):
          3 
          4         DEVICE = "cuda:0"
          5         MAX_LEN = 256
    
    <ipython-input-33-83880b6b788e> in config()
         11         TOKENIZER = transformers.BertTokenizer.from_pretrained(
         12             BERT_PATH["bert-base-multilingual-uncased"],
    ---> 13             do_lower_case=True)
    
    /opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, *inputs, **kwargs)
       1138 
       1139         """
    -> 1140         return cls._from_pretrained(*inputs, **kwargs)
       1141 
       1142     @classmethod
    
    /opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
       1244                     ", ".join(s3_models),
       1245                     pretrained_model_name_or_path,
    -> 1246                     list(cls.vocab_files_names.values()),
       1247                 )
       1248             )
    
    OSError: Model name 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was not  
 found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking,   
bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc,   
bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1,     
wietsedv/bert-base-dutch-cased). 

We assumed 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was a path, a model   identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such  
 vocabulary files at this path or url.

As I can interpret the error, it says that the vocab.txt file was not found at the given location but actually its present.

Following are the files available in the bert-base-multilingual-uncased folder:

  • config.json
  • pytorch_model.bin
  • vocab.txt

I am new to working with bert, so I am not sure if there is a different way to define the tokenizer.


Solution

  • I think this should work:

    from transformers import BertTokenizer
    TOKENIZER = BertTokenizer.from_pretrained('bert-base-multilingual-uncased', do_lower_case=True)
    

    It will download the tokenizer from huggingface.