I tried following tokenization example:
tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True)
sent = "I hate this. Not that.",
_tokenized = tokenizer(sent, padding=True, max_length=20, truncation=True)
print(_tknzr.decode(_tokenized['input_ids'][0]))
print(len(_tokenized['input_ids'][0]))
The output was:
[CLS] i hate this. not that. [SEP]
9
Notice the parameter to tokenizer
: max_length=20
. How can I make Bert tokenizer to append 11 [PAD]
tokens to this sentence to make it total 20
?
One should set padding="max_length"
:
_tokenized = tokenizer(sent, padding="max_length", max_length=20, truncation=True)