I want to apply Roberta model for text similarity. Given a pair of sentences,the input should be in the format <s> A </s></s> B </s>
. I figure out two possible ways to generate the input ids namely
a)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands')
list2 = tokenizer.encode('Numbness of upper limb')
sequence = list1+[2]+list2[1:]
In this case, sequence is [0, 12178, 3814, 2400, 11, 1420, 2, 2, 234, 4179, 1825, 9, 2853, 29654, 2]
b)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = [0]+list1+[2,2]+list2+[2]
In this case, sequence is [0, 25101, 3814, 2400, 11, 1420, 2, 2, 487, 4179, 1825, 9, 2853, 29654, 2]
Here 0
represents <s>
token and 2 represents </s>
token. I'm not sure which is the correct way to encode the given two sentences for calculating sentence similarity using Roberta model.
The easiest way is probably to directly use the provided function by HuggingFace's Tokenizers themselves, namely the text_pair
argument in the encode
function, see here. This allows you to directly feed in two sentences, which will be giving you the desired output:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
sequence = tokenizer.encode(text='Very severe pain in hands',
text_pair='Numbness of upper limb',
add_special_tokens=True)
This is especially convenient if you are dealing with very long sequences, as the encode
function automatically reduces your lengths according to the truncaction_strategy
argument. You obviously don't have to worry about this, if it is only short sequences.
Alternatively, you can also make use of the more explicit build_inputs_with_special_tokens()
function of the RobertaTokenizer
, specifically, which could be added to your example like so:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = tokenizer.build_inputs_with_special_tokens(list1, list2)
Note that in that case, you have to generate the sequences list1
and list2
still without any special tokens, as you have already done correctly.