Search code examples
pythonpytorchhuggingface-transformers

How do I check if a tokenizer/model is already saved


I am using HuggingFace Transformers with PyTorch. My modus operandi is to download a pre-trained model and save it in a local project folder.

While doing so, I can see that .bin file is saved locally, which stands for the model. However, I am also downloading and saving a tokenizer, for which I cannot see any associated file.

So, how do I check if a tokenizer is saved locally before downloading? Secondly, apart from the usual os.path.isfile(...) check, is there any other better way to prioritize local copy usage from a given location before downloading?


Solution

  • I've used this code in the past for this purpose. You can adapt it to your setting.

    from tokenizers import BertWordPieceTokenizer
    import urllib
    from transformers import AutoTokenizer
    
    def download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=False):
        vocab_files_map = tokenizer.pretrained_vocab_files_map
        vocab_files = {}
        for resource in vocab_files_map.keys():
            download_location = vocab_files_map[resource][model_type]
            f_path = os.path.join(output_path, os.path.basename(download_location))
            if vocab_exist_bool != True:
                urllib.request.urlretrieve(download_location, f_path)
            vocab_files[resource] = f_path
        return vocab_files
    
    model_type = 'bert-base-uncased'
    #initialized tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_type)
    #will do this part later
    
    #retrieve vocab file if it's not there
    output_path = os.getcwd()+'/vocab_files/'
    vocab_file_name = 'bert-base-uncased-vocab.txt'
    vocab_exist_bool = os.path.exists(output_path + vocab_file_name)
    
    #get vocab files
    vocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=vocab_exist_bool)