python huggingface-transformers sentence-transformers

Token indices sequence length Issue

I am running a sentence transformer model and trying to truncate my tokens, but it doesn't appear to be working. My code is

from transformers import AutoModel, AutoTokenizer
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
    
text_tokens = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
text_embedding = model(**text_tokens)["pooler_output"]

I keep getting the following warning:

Token indices sequence length is longer than the specified maximum sequence length 
for this model (909 > 512). Running this sequence through the model will result in 
indexing errors

I am wondering why setting truncation=True is not truncating my text to the desired length?

Solution

You need to add the max_length parameter while creating the tokenizer like below:

text_tokens = tokenizer(text, padding=True, max_length=512, truncation=True, return_tensors="pt")

Reason:

truncation=True without max_length parameter takes sequence length equal to maximum acceptable input length by the model.

It is 1e30 or 1000000000000000019884624838656 for this model. You can check by printing out tokenizer.model_max_length.

According to the Huggingface documentation about truncation,

True or 'only_first' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None).