Search code examples
pythonhuggingface-transformersnamed-entity-recognitionhuggingface-tokenizershuggingface

How to pass arguments to HuggingFace TokenClassificationPipeline's tokenizer


I've finetuned a Huggingface BERT model for Named Entity Recognition. Everything is working as it should. Now I've setup a pipeline for token classification in order to predict entities out the text I provide. Even this is working fine.

I know that BERT models are supposed to be fed with sentences less than 512 tokens long. Since I have texts longer than that, I split the sentences in shorter chunks and I store the chunks in a list chunked_sentences. To make it brief my tokenizer for training looks like this:

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenized_inputs = tokenizer(chunked_sentences, is_split_into_words=True, padding='longest')

I pad everything to the longest sequence and avoid truncation so that if a sentence is tokenized and goes beyond 512 tokens I receive a warning that I won't be able to train. This way I know that I have to split the sentences in smaller chunks.

During inference I wanted to achieve the same thing, but I haven't found a way to pass arguments to the pipeline's tokenizer. The code looks like this:

from transformers import pipeline
ner_pipeline = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')

I'm pretty sure that if a sentence is tokenized and surpasses the 512 tokens, the extra tokens will be truncated and I'll get no warning. I want to avoid this.

I tried passing arguments to the tokenizer like this:

tokenizer_kwargs = {'padding': 'longest'}
out = ner_pipeline(text, aggregation_strategy='simple', **tokenizer_kwargs)

I got that idea from this answer, but it seems not to be working, since I get the following error:

Traceback (most recent call last):
  File "...\inference.py", line 42, in <module>
    out = ner_pipeline(text, aggregation_strategy='simple', **tokenizer_kwargs)
  File "...\venv\lib\site-packages\transformers\pipelines\token_classification.py", line 191, in __call__
    return super().__call__(inputs, **kwargs)
  File "...\venv\lib\site-packages\transformers\pipelines\base.py", line 1027, in __call__
    preprocess_params, forward_params, postprocess_params = self._sanitize_parameters(**kwargs)
TypeError: TokenClassificationPipeline._sanitize_parameters() got an unexpected keyword argument 'padding'

Process finished with exit code 1

Any ideas? Thanks.


Solution

  • I took a closer look at https://github.com/huggingface/transformers/blob/v4.24.0/src/transformers/pipelines/token_classification.py#L86. It seems you can override preprocess() to disable truncation and add padding to longest.

    from transformers import TokenClassificationPipeline
    
    class MyTokenClassificationPipeline(TokenClassificationPipeline):
        def preprocess(self, sentence, offset_mapping=None):
            truncation = False
            padding = 'longest'
            model_inputs = self.tokenizer(
                sentence,
                return_tensors=self.framework,
                truncation=truncation,
                padding=padding,
                return_special_tokens_mask=True,
                return_offsets_mapping=self.tokenizer.is_fast,
            )
            if offset_mapping:
                model_inputs["offset_mapping"] = offset_mapping
        
            model_inputs["sentence"] = sentence
            return model_inputs
        
    ner_pipeline = MyTokenClassificationPipeline(model=model_folder, tokenizer=model_folder)
    out = ner_pipeline(text, aggregation_strategy='simple')