python huggingface-transformers named-entity-recognition huggingface-tokenizers huggingface

How to pass arguments to HuggingFace TokenClassificationPipeline's tokenizer

I've finetuned a Huggingface BERT model for Named Entity Recognition. Everything is working as it should. Now I've setup a pipeline for token classification in order to predict entities out the text I provide. Even this is working fine.

I know that BERT models are supposed to be fed with sentences less than 512 tokens long. Since I have texts longer than that, I split the sentences in shorter chunks and I store the chunks in a list chunked_sentences. To make it brief my tokenizer for training looks like this:

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenized_inputs = tokenizer(chunked_sentences, is_split_into_words=True, padding='longest')

I pad everything to the longest sequence and avoid truncation so that if a sentence is tokenized and goes beyond 512 tokens I receive a warning that I won't be able to train. This way I know that I have to split the sentences in smaller chunks.

During inference I wanted to achieve the same thing, but I haven't found a way to pass arguments to the pipeline's tokenizer. The code looks like this:

from transformers import pipeline
ner_pipeline = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')

I'm pretty sure that if a sentence is tokenized and surpasses the 512 tokens, the extra tokens will be truncated and I'll get no warning. I want to avoid this.

I tried passing arguments to the tokenizer like this:

tokenizer_kwargs = {'padding': 'longest'}
out = ner_pipeline(text, aggregation_strategy='simple', **tokenizer_kwargs)

I got that idea from this answer, but it seems not to be working, since I get the following error:

Traceback (most recent call last):
  File "...\inference.py", line 42, in <module>
    out = ner_pipeline(text, aggregation_strategy='simple', **tokenizer_kwargs)
  File "...\venv\lib\site-packages\transformers\pipelines\token_classification.py", line 191, in __call__
    return super().__call__(inputs, **kwargs)
  File "...\venv\lib\site-packages\transformers\pipelines\base.py", line 1027, in __call__
    preprocess_params, forward_params, postprocess_params = self._sanitize_parameters(**kwargs)
TypeError: TokenClassificationPipeline._sanitize_parameters() got an unexpected keyword argument 'padding'

Process finished with exit code 1

Any ideas? Thanks.

Solution

I took a closer look at https://github.com/huggingface/transformers/blob/v4.24.0/src/transformers/pipelines/token_classification.py#L86. It seems you can override preprocess() to disable truncation and add padding to longest.

from transformers import TokenClassificationPipeline

class MyTokenClassificationPipeline(TokenClassificationPipeline):
    def preprocess(self, sentence, offset_mapping=None):
        truncation = False
        padding = 'longest'
        model_inputs = self.tokenizer(
            sentence,
            return_tensors=self.framework,
            truncation=truncation,
            padding=padding,
            return_special_tokens_mask=True,
            return_offsets_mapping=self.tokenizer.is_fast,
        )
        if offset_mapping:
            model_inputs["offset_mapping"] = offset_mapping
    
        model_inputs["sentence"] = sentence
        return model_inputs
    
ner_pipeline = MyTokenClassificationPipeline(model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')