Search code examples
deep-learninghuggingface-transformershuggingfacehuggingface-tokenizers

How to stop at 512 tokens when sending text to pipeline? HuggingFace and Transformers


I want to test my model using Pipeline by Transformers. My model is a pretrained BERT, which works great if the given text is < 512 tokens. However, when sending the a larger text to the pipeline, it breaks, because it's too long. I tried to search, but couldn't figure out how to solve this issue.

This is my code:

def get_predicted_folder(text, model):
    pipe = pipeline("text-classification", model=model)
    if text:
        predicted_folder = pipe(text)
        label = predicted_folder[0]['label']
        score = predicted_folder[0]['score']
        return label, score
    else:
        err = "Error: The provided text is empty."
        return err, None

my_saved_model = "model/danish_bert_model" (it is saved locally)
label, score = get_predicted_folder(text, my_saved_model)

It gives me this error: RuntimeError: The size of tensor a (1593) must match the size of tensor b (512) at non-singleton dimension 1

I tried to give tokenizer=model to the pipeline, and have this tokenizer = AutoTokenizer.from_pretrained(model_ckpt) before calling the get_predicted_folder method, but it doesn't solve the issue.

The tokenizer inside the model is this:

def tokenize_dataset(tokenizer, examples):
    return tokenizer(examples['text'], truncation=True, max_length=512)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    dataset = dataset.map(lambda examples: tokenize_dataset(tokenizer, examples), batched=True)

Can someone please help me?

Thanks so much in advance!


Solution

  • Only add the tokenizer, maximum length and truncation to the pipe as well and it will work well.

     pipe = pipeline("text-classification", model=model, tokenizer=model_path, max_length=512, truncation=True)