bert-language-model huggingface-tokenizers sentencepiece

How to add new special token to the tokenizer?

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).

QUERY: I want to ask a question.
ANSWER: Sure, ask away.
QUERY: How is the weather today?
ANSWER: It is nice and sunny.
QUERY: Okay, nice to know.
ANSWER: Would you like to know anything else?

Apart from this I have two more inputs.

I was wondering if I should put special token in the conversation to make it more meaning to the BERT model, like:

[CLS]QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else? [SEP]

But I am not able to add a new [EOT] special token.
Or should I use [SEP] token for this?

EDIT: steps to reproduce

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]

num_added_toks = tokenizer.add_tokens(['[EOT]'])
model.resize_token_embeddings(len(tokenizer))  # --> Embedding(30523, 768)

tokenizer.convert_tokens_to_ids('[EOT]')  # --> 30522

text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''

enc = tokenizer.encode_plus(
  text_to_encode,
  max_length=128,
  add_special_tokens=True,
  return_token_type_ids=False,
  return_attention_mask=False,
)['input_ids']

print(tokenizer.convert_ids_to_tokens(enc))

Result:

['[CLS]', 'query', ':', 'i', 'want', 'to', 'ask', 'a', 'question', '.', '[', 'e', '##ot', ']', 'answer', ':', 'sure', ',', 'ask', 'away', '.', '[', 'e', '##ot', ']', 'query', ':', 'how', 'is', 'the', 'weather', 'today', '?', '[', 'e', '##ot', ']', 'answer', ':', 'it', 'is', 'nice', 'and', 'sunny', '.', '[', 'e', '##ot', ']', 'query', ':', 'okay', ',', 'nice', 'to', 'know', '.', '[', 'e', '##ot', ']', 'answer', ':', 'would', 'you', 'like', 'to', 'know', 'anything', 'else', '?', '[SEP]']

Solution

As the intention of the [SEP] token was to act as a separator between two sentence, it fits your objective of using [SEP] token to separate sequences of QUERY and ANSWER.

You also try to add different tokens to mark the beginning and end of QUERY or ANSWER as <BOQ> and <EOQ> to mark the beginning and end of QUERY. Likewise, <BOA> and <EOA> to mark the beginning and end of ANSWER.

Sometimes, using the existing token works much better than adding new tokens to the vocabulary, as it requires huge number of training iterations as well as the data to learn the new token embedding.

However, if you want to add a new token if your application demands so, then it can be added as follows:

num_added_toks = tokenizer.add_tokens(['[EOT]'], special_tokens=True) ##This line is updated
model.resize_token_embeddings(len(tokenizer))

###The tokenizer has to be saved if it has to be reused
tokenizer.save_pretrained(<output_dir>)