Search code examples
pythonnlphuggingface-transformers

Pretrained model with stride doesn’t predict long text


My objective is to annotate long documents with bioformer-8L. I have been said to use stride and truncation so I don’t have to split my documents in chunks of 512 tokens.

In the training phase, I called the tokenizer like this:

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)

Then I train my model and at this stage I don't see any parameter that could help me in my task.

With my trained model I do this for the predictions:

model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)
ner = pipeline(“token-classification”, model=model, tokenizer=tokenizer, aggregation_strategy=“first”)

But it does not work, the model stops providing annotations in the middle of the text. For the test


Solution

  • You can't just move the __call__ parameters like stride to from_pretrained:

    from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
    
    model_id = 'Davlan/distilbert-base-multilingual-cased-ner-hrl'
    t = AutoTokenizer.from_pretrained(model_id, stride =3, return_overflowing_tokens=True, model_max_length=10, truncation=True, is_split_into_words=True)
    
    sample = "test "*200
    
    # sliding window will not be applied
    print(len(t(sample).input_ids))
    
    # sliding window will be applied
    print(len(t(sample, max_length=10, truncation=True, stride=3, return_overflowing_tokens=True).input_ids))
    

    Output:

    202 # One batch with 202 ids
    40 # 40 batches
    

    With the pipeline, you can pass the value for stride as __init__ parameter:

    from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
    
    model_id = 'Davlan/distilbert-base-multilingual-cased-ner-hrl'
    
    ner = pipeline("token-classification", model_id, stride=128, aggregation_strategy="first")
    
    sample = "Hi my name is cronoik and I live in Germany "*3000
    
    o = ner(sample)
    print(len(o))
    print(o[0:5])
    

    Output:

    3000
    [{'entity_group': 'LOC', 'score': 0.9997917, 'word': 'Germany', 'start': 36, 'end': 43}, 
     {'entity_group': 'LOC', 'score': 0.9998311, 'word': 'Germany', 'start': 80, 'end': 87},
     {'entity_group': 'LOC', 'score': 0.9997998, 'word': 'Germany', 'start': 124, 'end': 131},
     {'entity_group': 'LOC', 'score': 0.9997831, 'word': 'Germany', 'start': 168, 'end': 175},
     {'entity_group': 'LOC', 'score': 0.99981374, 'word': 'Germany', 'start': 212, 'end': 219}]