nlp huggingface-transformers huggingface-tokenizers sentence-transformers

Huggingface pretrained model's tokenizer and model objects have different maximum input length

I'm using symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli pretrained model from huggingface. My task requires to use it on pretty large texts, so it's essential to know maximum input length.

The following code is supposed to load pretrained model and its tokenizer:

encoding_model_name = "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli"
encoding_tokenizer = AutoTokenizer.from_pretrained(encoding_model_name)
encoding_model = SentenceTransformer(encoding_model_name)

So, when I print info about them:

encoding_tokenizer
encoding_model

I'm getting:

PreTrainedTokenizerFast(name_or_path='symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli', vocab_size=250002, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

As you can see, model_max_len=512 parameter in tokenizer doesn't match max_seq_length=128 parameter in model

How can I figure out which one is true? Or, probably, if they somehow respond to different features, how I can check maximum input length for my model?

Solution

Since you are using a SentenceTransformer and load it to the SentenceTransformer class, it will truncate your input at 128 tokens as stated by the documentation (the relevant code is here):

property max_seq_length
Property to get the maximal input sequence length for the model. Longer inputs will be truncated.

You can also check this by yourself:

fifty = model.encode(["This "*50], convert_to_tensor=True)
two_hundered = model.encode(["This "*200], convert_to_tensor=True)
four_hundered = model.encode(["This "*400], convert_to_tensor=True)

print(torch.allclose(fifty, two_hundered))
print(torch.allclose(two_hundered,four_hundered))

Output:

False
True

The underlying model (xlm-roberta-base) is able to handle sequences with up to 512 tokens, but I assume Symanto limited it to 128 because they also used this limit during training (i.e. the embeddings might be not good for sequences longer than 128 tokens).