How to know which token are unk token from Hugging Face tokenizer?

I want to add some new tokens to the tokenizer of pre-trained Tokenizer to do fine-tunning on my downstream task. But I don't want to check by looking at each sample to know which token not in the vocabulary of the Tokenizer. Is there any way to extract which string tokens are unknown token when I pass in a corpus?

Solution

I'm not sure whether you can reliably/efficiently determine whether a token is unknown without passing it through the tokeniser, particularly due to many contemporary tokenisers tokenising using sub-words.

However, you can drastically reduce the processing time needed, by running the tokeniser on the list of unique words. Note that by words here I'm actually referring to "traditional" non-sub-word tokens.

Extracting a set of unique words

To do this, you can get the list of words using the pre-tokenizer:

>>> tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("I'm good, thanks!")
['I', "'", 'm', 'good', ',', 'thanks', '!']

You can of course opt to not use the pre_tokenizer & just separate on space, but this will increase the number of unique words greatly, particularly due to punctuation marks not being space separated. This will also depend on the language you are working with.

In addition, depending on your data and tokeniser, it might be useful to normalise the text before pre-tokenizing. For example if your model is uncased, it would be beneficial to lower-case all tokens, further reducing the number of unique words.

You might find this guide useful, as it goes into further detail on the pre-processing steps that the tokeniser performs.

Running the tokeniser on the unique tokens

Add these pre-tokenised tokens to a set:

unique_tokens = set()
for text in corpus:
    tokens = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    unique_tokens.update([token for token, _ in tokens])

Then, run your tokeniser on unique_tokens, extracting the tokens which are unknown by the tokeniser:

unique_tokens = list(unique_tokens)
unknown_tokens = []
for i, sub_tokens in enumerate(tokenizer(unique_tokens)["input_ids"]):
    if tokenizer.unk_token_id in sub_tokens:
            unknown_tokens.append(unique_tokens[i])