I want to test my model using Pipeline by Transformers. My model is a pretrained BERT, which works great if the given text is < 512 tokens. However, when sending the a larger text to the pipeline, it breaks, because it's too long. I tried to search, but couldn't figure out how to solve this issue.
This is my code:
def get_predicted_folder(text, model):
pipe = pipeline("text-classification", model=model)
if text:
predicted_folder = pipe(text)
label = predicted_folder[0]['label']
score = predicted_folder[0]['score']
return label, score
else:
err = "Error: The provided text is empty."
return err, None
my_saved_model = "model/danish_bert_model" (it is saved locally)
label, score = get_predicted_folder(text, my_saved_model)
It gives me this error: RuntimeError: The size of tensor a (1593) must match the size of tensor b (512) at non-singleton dimension 1
I tried to give tokenizer=model
to the pipeline, and have this tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
before calling the get_predicted_folder
method, but it doesn't solve the issue.
The tokenizer inside the model is this:
def tokenize_dataset(tokenizer, examples):
return tokenizer(examples['text'], truncation=True, max_length=512)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
dataset = dataset.map(lambda examples: tokenize_dataset(tokenizer, examples), batched=True)
Can someone please help me?
Thanks so much in advance!
Only add the tokenizer, maximum length and truncation to the pipe as well and it will work well.
pipe = pipeline("text-classification", model=model, tokenizer=model_path, max_length=512, truncation=True)