I've finetuned a Huggingface BERT model for Named Entity Recognition based on 'bert-base-uncased'
. I perform inference like this:
from transformers import pipeline
ner_pipeline = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')
I want to obtain results on very long texts, and since I know of the 512 token maximum capacity for both training and inference, I split my text
s in smaller chunks before passing those to the ner_pipeline
.
But, how do I split the text without actually tokenizing the texts myself in order to check for the length of each chunk? I want to make them as long as possible, but at the same time I don't want to exceed the maximum 512 tokens, risking that no predictions are computed on what's left of the sentence.
Is there a way to know if the texts I'm feeding exceed the 512 maximum tokens?
Finding out whether tokenized text exceeds 512 tokens is simply checking its tokenized output. For this purpose, you can simply use AutoTokenizer
library of HuggingFace. For example,
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence = "Sentence to check whether it exceeds 512 tokens"
tokenized_sentence = tokenizer.tokenize(sentence)
print(len(sentence.split())) # here is the default length of the sentence
print(len(tokenized_sentence)) # here is the tokenized length
You can give it a try for long documents and observe that at some points tokenized lengths exceed 512 tokens. This may not be a problem for text classification but you may lose your entity labels for the token classification task. Thus, before feeding your Transformer-based network with long documents, you should preprocess your texts with AutoTokenizer
, find the points where the tokenized texts reach the maximum length of the model input size (e.g, 512), and simply cut the sentence from that point and create a new sample from the remaining part of that long document.