Pretrained model with stride doesn’t predict long text

My objective is to annotate long documents with bioformer-8L. I have been said to use stride and truncation so I don’t have to split my documents in chunks of 512 tokens.

In the training phase, I called the tokenizer like this:

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)

Then I train my model and at this stage I don't see any parameter that could help me in my task.

With my trained model I do this for the predictions:

model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)
ner = pipeline(“token-classification”, model=model, tokenizer=tokenizer, aggregation_strategy=“first”)

But it does not work, the model stops providing annotations in the middle of the text. For the test

Solution

You can't just move the __call__ parameters like stride to from_pretrained:

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model_id = 'Davlan/distilbert-base-multilingual-cased-ner-hrl'
t = AutoTokenizer.from_pretrained(model_id, stride =3, return_overflowing_tokens=True, model_max_length=10, truncation=True, is_split_into_words=True)

sample = "test "*200

# sliding window will not be applied
print(len(t(sample).input_ids))

# sliding window will be applied
print(len(t(sample, max_length=10, truncation=True, stride=3, return_overflowing_tokens=True).input_ids))

Output:

202 # One batch with 202 ids
40 # 40 batches

With the pipeline, you can pass the value for stride as __init__ parameter:

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model_id = 'Davlan/distilbert-base-multilingual-cased-ner-hrl'

ner = pipeline("token-classification", model_id, stride=128, aggregation_strategy="first")

sample = "Hi my name is cronoik and I live in Germany "*3000

o = ner(sample)
print(len(o))
print(o[0:5])

Output:

3000
[{'entity_group': 'LOC', 'score': 0.9997917, 'word': 'Germany', 'start': 36, 'end': 43}, 
 {'entity_group': 'LOC', 'score': 0.9998311, 'word': 'Germany', 'start': 80, 'end': 87},
 {'entity_group': 'LOC', 'score': 0.9997998, 'word': 'Germany', 'start': 124, 'end': 131},
 {'entity_group': 'LOC', 'score': 0.9997831, 'word': 'Germany', 'start': 168, 'end': 175},
 {'entity_group': 'LOC', 'score': 0.99981374, 'word': 'Germany', 'start': 212, 'end': 219}]