I'm using symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli pretrained model from huggingface. My task requires to use it on pretty large texts, so it's essential to know maximum input length.
The following code is supposed to load pretrained model and its tokenizer:
encoding_model_name = "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli"
encoding_tokenizer = AutoTokenizer.from_pretrained(encoding_model_name)
encoding_model = SentenceTransformer(encoding_model_name)
So, when I print info about them:
encoding_tokenizer
encoding_model
I'm getting:
PreTrainedTokenizerFast(name_or_path='symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli', vocab_size=250002, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
As you can see, model_max_len=512 parameter in tokenizer doesn't match max_seq_length=128 parameter in model
How can I figure out which one is true? Or, probably, if they somehow respond to different features, how I can check maximum input length for my model?
Since you are using a SentenceTransformer and load it to the SentenceTransformer class, it will truncate your input at 128 tokens as stated by the documentation (the relevant code is here):
property max_seq_length
Property to get the maximal input sequence length for the model. Longer inputs will be truncated.
You can also check this by yourself:
fifty = model.encode(["This "*50], convert_to_tensor=True)
two_hundered = model.encode(["This "*200], convert_to_tensor=True)
four_hundered = model.encode(["This "*400], convert_to_tensor=True)
print(torch.allclose(fifty, two_hundered))
print(torch.allclose(two_hundered,four_hundered))
Output:
False
True
The underlying model (xlm-roberta-base) is able to handle sequences with up to 512 tokens, but I assume Symanto limited it to 128 because they also used this limit during training (i.e. the embeddings might be not good for sequences longer than 128 tokens).