I would like to encode the sentence with BOS and EOS token. When I load a pretrained tokenizer, there is no BOS token, so I added BOS token to the tokenizer. After that, I encoded the sentence.
model_checkpoint = "facebook/wmt19-en-de"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.add_special_tokens({'bos_token' : '<s>'})
tokenizer.encode("Resumption of the session", add_special_tokens = True)
result: [2642, 4584, 636, 9, 6, 9485, 2] # 2642 is not BOS token, and 2 is EOS token.
However, the result shows that BOS token does not appear in the encoded sentence. How could I include BOS token when encoding?
Even though you've specified bos_token
to be a string of your choosing, you still need to set the add_bos_token
property of tokenizer
to True to get the tokenizer to stick a bos_token
on the front of its output.
You can do this when you instantiate AutoTokenizer
:
tokenizer = AutoTokenizer.from_pretrained(
model_checkpoint,
bos_token = '<s>',
add_bos_token = True
)
tok_enc = tokenizer.encode("Resumption of the session")
print(tok_enc)
print(tokenizer.decode(tok_enc[0]))
Output:
[50257, 4965, 24098, 286, 262, 6246]`
<s>
... or you could use the add_special_tokens
method & set add_bos_token
property following instantiation:
tokenizer = AutoTokenizer.from_pretrained(
model_checkpoint
)
tokenizer.add_special_tokens({'bos_token' : '<s>'})
tokenizer.add_bos_token = True
tok_enc = tokenizer.encode("Resumption of the session")
print(tok_enc)
print(tokenizer.decode(tok_enc[0]))
Output:
[50257, 4965, 24098, 286, 262, 6246]
<s>
Note that the tokenizer you select may have a default bos_token
which means you could simply add add_bos_token = True
without specifying bos_token = '<s>'
(unless you want to customise the bos_token
of course).