Search code examples
huggingface-tokenizers

Tokenizer didn't add BOS token when encoding the sentence


I would like to encode the sentence with BOS and EOS token. When I load a pretrained tokenizer, there is no BOS token, so I added BOS token to the tokenizer. After that, I encoded the sentence.

model_checkpoint = "facebook/wmt19-en-de"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.add_special_tokens({'bos_token' : '<s>'})

tokenizer.encode("Resumption of the session", add_special_tokens = True)

result: [2642, 4584, 636, 9, 6, 9485, 2] # 2642 is not BOS token, and 2 is EOS token.

However, the result shows that BOS token does not appear in the encoded sentence. How could I include BOS token when encoding?


Solution

  • Even though you've specified bos_token to be a string of your choosing, you still need to set the add_bos_token property of tokenizer to True to get the tokenizer to stick a bos_token on the front of its output.

    You can do this when you instantiate AutoTokenizer:

    tokenizer = AutoTokenizer.from_pretrained(
        model_checkpoint,
        bos_token = '<s>',
        add_bos_token = True
    )
    
    tok_enc = tokenizer.encode("Resumption of the session")
    print(tok_enc)
    print(tokenizer.decode(tok_enc[0]))
    

    Output:

    [50257, 4965, 24098, 286, 262, 6246]`
    <s>
    

    ... or you could use the add_special_tokens method & set add_bos_token property following instantiation:

    tokenizer = AutoTokenizer.from_pretrained(
        model_checkpoint
    )
    
    tokenizer.add_special_tokens({'bos_token' : '<s>'})
    tokenizer.add_bos_token = True
    
    tok_enc = tokenizer.encode("Resumption of the session")
    print(tok_enc)
    print(tokenizer.decode(tok_enc[0]))
    

    Output:

    [50257, 4965, 24098, 286, 262, 6246]
    <s>
    

    Note that the tokenizer you select may have a default bos_token which means you could simply add add_bos_token = True without specifying bos_token = '<s>' (unless you want to customise the bos_token of course).