How can I push a custom tokenizer to HuggingFace Hub?

I have a custom Tokenizer built & trained using HuggingFace Tokenizers functions. I can save & load the custom tokenizer to a JSON file without a problem.

Here are the simplified codes:

model = models.WordPiece(unk_token="[UNK]")
tokenizer = Tokenizer(model)
# training from dataset in memory
tokenizer.train_from_iterator(get_training_corpus())
# save to a file
tokenizer.save('my-tokenizer.json')

Here is how I load the custom tokenizer:

tokenizer = Tokenizer.from_file('my-tokenizer.json')

The problem is, can I push my custom tokenizer to HuggingFace Hub? There is no push_to_hub() function in the Tokenizer class.

I know if I train from a pre-trained model using the codes below, I can save the new pre-trained model (and push it to HuggingFace Hub) using the following codes:

from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("a-pretrained-model")
tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus())
# save the pre-trained tokenizer to the specified folder with config.json and other files
tokenizer.save_pretrained("my-new-shiny-tokenizer")
# push the pre-trained tokenizer to HuggingFace Hub
tokenizer.push_to_hub("my-new-shiny-tokenizer-in-hf")

But I cannot use this approach, as my tokenizer requires a custom decoder, normalizer and pre-tokenizer.

Solution

You are almost there! Your currently implemented tokenizer is based on a class from the tokenizers library. Now, you must first wrap your tokenizer into a tokenizer class from the transformers library. For example:

# Wrap your own tokenizer
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="my-tokenizer.json", # You can load from the tokenizer file
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

# Finally, save your own pretrained tokenizer
wrapped_tokenizer.save_pretrained('my-tokenizer')

Code is taken from this tutorial: https://huggingface.co/learn/nlp-course/chapter6/8#building-a-bpe-tokenizer-from-scratch

However, since you asked a more general question, here are a few more steps you need to complete to push it. I'm assuming below that you're working in a notebook.

In your HuggingFace account,

create a 'New Model', let's call it foo.
(optionally, if you haven't done so yet) go to Settings -> Access Tokens and create a New token, call it 'notebooks' and use the 'write' Role.

In your notebook, enter this:

from huggingface_hub import login
login()

and copy-paste your 'write' token.

Finally, in your notebook, you can push your own tokenizer to the 'foo' repo using: tokenizer.push_to_hub(repo_id='foo')