My objective is to annotate long documents with bioformer-8L. I have been said to use stride and truncation so I don’t have to split my documents in chunks of 512 tokens.
In the training phase, I called the tokenizer like this:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)
Then I train my model and at this stage I don't see any parameter that could help me in my task.
With my trained model I do this for the predictions:
model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)
ner = pipeline(“token-classification”, model=model, tokenizer=tokenizer, aggregation_strategy=“first”)
But it does not work, the model stops providing annotations in the middle of the text. For the test
You can't just move the __call__
parameters like stride to from_pretrained
:
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model_id = 'Davlan/distilbert-base-multilingual-cased-ner-hrl'
t = AutoTokenizer.from_pretrained(model_id, stride =3, return_overflowing_tokens=True, model_max_length=10, truncation=True, is_split_into_words=True)
sample = "test "*200
# sliding window will not be applied
print(len(t(sample).input_ids))
# sliding window will be applied
print(len(t(sample, max_length=10, truncation=True, stride=3, return_overflowing_tokens=True).input_ids))
Output:
202 # One batch with 202 ids
40 # 40 batches
With the pipeline, you can pass the value for stride as __init__
parameter:
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model_id = 'Davlan/distilbert-base-multilingual-cased-ner-hrl'
ner = pipeline("token-classification", model_id, stride=128, aggregation_strategy="first")
sample = "Hi my name is cronoik and I live in Germany "*3000
o = ner(sample)
print(len(o))
print(o[0:5])
Output:
3000
[{'entity_group': 'LOC', 'score': 0.9997917, 'word': 'Germany', 'start': 36, 'end': 43},
{'entity_group': 'LOC', 'score': 0.9998311, 'word': 'Germany', 'start': 80, 'end': 87},
{'entity_group': 'LOC', 'score': 0.9997998, 'word': 'Germany', 'start': 124, 'end': 131},
{'entity_group': 'LOC', 'score': 0.9997831, 'word': 'Germany', 'start': 168, 'end': 175},
{'entity_group': 'LOC', 'score': 0.99981374, 'word': 'Germany', 'start': 212, 'end': 219}]