Search code examples
pythonnlptokenhuggingface-transformersbert-language-model

How to get number of tokens in the sentence in keras


I have a sentence and a pre-trained tokenizer. I want to calculate the number of tokens in the sentence, without special tokens. I use the code from HuggingFace.

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertModel.from_pretrained("bert-base-cased")
text = "I want to know the number of tokens in this sentence!!!"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

How can I do it?


Solution

  • You can either use encode method with setting add_special_tokens to False or basically use tokenize method.

    encoded_input = tokenizer(text, return_tensors='tf', add_special_tokens=False)
    encoded_input.input_ids.shape[1]
    

    and

    tokenized_input = tokenizer.tokenize(text)
    len(tokenized_input)