I am running a sentence transformer model and trying to truncate my tokens, but it doesn't appear to be working. My code is
from transformers import AutoModel, AutoTokenizer
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text_tokens = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
text_embedding = model(**text_tokens)["pooler_output"]
I keep getting the following warning:
Token indices sequence length is longer than the specified maximum sequence length
for this model (909 > 512). Running this sequence through the model will result in
indexing errors
I am wondering why setting truncation=True
is not truncating my text to the desired length?
You need to add the max_length
parameter while creating the tokenizer like below:
text_tokens = tokenizer(text, padding=True, max_length=512, truncation=True, return_tensors="pt")
truncation=True
without max_length
parameter takes sequence length equal to maximum acceptable input length by the model.
It is 1e30
or 1000000000000000019884624838656
for this model. You can check by printing out tokenizer.model_max_length
.
According to the Huggingface documentation about truncation
,
True or 'only_first' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None).