I have a sentence and a pre-trained tokenizer. I want to calculate the number of tokens in the sentence, without special tokens. I use the code from HuggingFace.
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertModel.from_pretrained("bert-base-cased")
text = "I want to know the number of tokens in this sentence!!!"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
How can I do it?
You can either use encode
method with setting add_special_tokens
to False
or basically use tokenize
method.
encoded_input = tokenizer(text, return_tensors='tf', add_special_tokens=False)
encoded_input.input_ids.shape[1]
and
tokenized_input = tokenizer.tokenize(text)
len(tokenized_input)