Search code examples
huggingface-tokenizers

How can I push a custom tokenizer to HuggingFace Hub?


I have a custom Tokenizer built & trained using HuggingFace Tokenizers functions. I can save & load the custom tokenizer to a JSON file without a problem.

Here are the simplified codes:

model = models.WordPiece(unk_token="[UNK]")
tokenizer = Tokenizer(model)
# training from dataset in memory
tokenizer.train_from_iterator(get_training_corpus())
# save to a file
tokenizer.save('my-tokenizer.json')

Here is how I load the custom tokenizer:

tokenizer = Tokenizer.from_file('my-tokenizer.json')

The problem is, can I push my custom tokenizer to HuggingFace Hub? There is no push_to_hub() function in the Tokenizer class.

I know if I train from a pre-trained model using the codes below, I can save the new pre-trained model (and push it to HuggingFace Hub) using the following codes:

from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("a-pretrained-model")
tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus())
# save the pre-trained tokenizer to the specified folder with config.json and other files
tokenizer.save_pretrained("my-new-shiny-tokenizer")
# push the pre-trained tokenizer to HuggingFace Hub
tokenizer.push_to_hub("my-new-shiny-tokenizer-in-hf")

But I cannot use this approach, as my tokenizer requires a custom decoder, normalizer and pre-tokenizer.


Solution

  • You are almost there! Your currently implemented tokenizer is based on a class from the tokenizers library. Now, you must first wrap your tokenizer into a tokenizer class from the transformers library. For example:

    # Wrap your own tokenizer
    from transformers import PreTrainedTokenizerFast
    
    wrapped_tokenizer = PreTrainedTokenizerFast(
        tokenizer_file="my-tokenizer.json", # You can load from the tokenizer file
        unk_token="[UNK]",
        pad_token="[PAD]",
        cls_token="[CLS]",
        sep_token="[SEP]",
        mask_token="[MASK]",
    )
    
    # Finally, save your own pretrained tokenizer
    wrapped_tokenizer.save_pretrained('my-tokenizer')
    

    Code is taken from this tutorial: https://huggingface.co/learn/nlp-course/chapter6/8#building-a-bpe-tokenizer-from-scratch

    However, since you asked a more general question, here are a few more steps you need to complete to push it. I'm assuming below that you're working in a notebook.

    In your HuggingFace account,

    1. create a 'New Model', let's call it foo.
    2. (optionally, if you haven't done so yet) go to Settings -> Access Tokens and create a New token, call it 'notebooks' and use the 'write' Role.

    In your notebook, enter this:

    from huggingface_hub import login
    login()
    

    and copy-paste your 'write' token.

    Finally, in your notebook, you can push your own tokenizer to the 'foo' repo using: tokenizer.push_to_hub(repo_id='foo')