Tokenizer didn't add BOS token when encoding the sentence

I would like to encode the sentence with BOS and EOS token. When I load a pretrained tokenizer, there is no BOS token, so I added BOS token to the tokenizer. After that, I encoded the sentence.

model_checkpoint = "facebook/wmt19-en-de"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.add_special_tokens({'bos_token' : '<s>'})

tokenizer.encode("Resumption of the session", add_special_tokens = True)

result: [2642, 4584, 636, 9, 6, 9485, 2] # 2642 is not BOS token, and 2 is EOS token.

However, the result shows that BOS token does not appear in the encoded sentence. How could I include BOS token when encoding?

Solution

Even though you've specified bos_token to be a string of your choosing, you still need to set the add_bos_token property of tokenizer to True to get the tokenizer to stick a bos_token on the front of its output.

You can do this when you instantiate AutoTokenizer:

tokenizer = AutoTokenizer.from_pretrained(
    model_checkpoint,
    bos_token = '<s>',
    add_bos_token = True
)

tok_enc = tokenizer.encode("Resumption of the session")
print(tok_enc)
print(tokenizer.decode(tok_enc[0]))

Output:

[50257, 4965, 24098, 286, 262, 6246]`
<s>

... or you could use the add_special_tokens method & set add_bos_token property following instantiation:

tokenizer = AutoTokenizer.from_pretrained(
    model_checkpoint
)

tokenizer.add_special_tokens({'bos_token' : '<s>'})
tokenizer.add_bos_token = True

tok_enc = tokenizer.encode("Resumption of the session")
print(tok_enc)
print(tokenizer.decode(tok_enc[0]))

Output:

[50257, 4965, 24098, 286, 262, 6246]
<s>

Note that the tokenizer you select may have a default bos_token which means you could simply add add_bos_token = True without specifying bos_token = '<s>' (unless you want to customise the bos_token of course).