Search code examples
nlppytorchhuggingface-transformersbert-language-model

Confusion in Pre-processing text for Roberta Model


I want to apply Roberta model for text similarity. Given a pair of sentences,the input should be in the format <s> A </s></s> B </s>. I figure out two possible ways to generate the input ids namely

a)

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('roberta-base')

list1 = tokenizer.encode('Very severe pain in hands')

list2 = tokenizer.encode('Numbness of upper limb')

sequence = list1+[2]+list2[1:]

In this case, sequence is [0, 12178, 3814, 2400, 11, 1420, 2, 2, 234, 4179, 1825, 9, 2853, 29654, 2]

b)

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('roberta-base')

list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)

list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)

sequence = [0]+list1+[2,2]+list2+[2]

In this case, sequence is [0, 25101, 3814, 2400, 11, 1420, 2, 2, 487, 4179, 1825, 9, 2853, 29654, 2]

Here 0 represents <s> token and 2 represents </s> token. I'm not sure which is the correct way to encode the given two sentences for calculating sentence similarity using Roberta model.


Solution

  • The easiest way is probably to directly use the provided function by HuggingFace's Tokenizers themselves, namely the text_pair argument in the encode function, see here. This allows you to directly feed in two sentences, which will be giving you the desired output:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained('roberta-base')
    sequence = tokenizer.encode(text='Very severe pain in hands',
                                text_pair='Numbness of upper limb',
                                add_special_tokens=True)
    

    This is especially convenient if you are dealing with very long sequences, as the encode function automatically reduces your lengths according to the truncaction_strategy argument. You obviously don't have to worry about this, if it is only short sequences.

    Alternatively, you can also make use of the more explicit build_inputs_with_special_tokens() function of the RobertaTokenizer, specifically, which could be added to your example like so:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained('roberta-base')
    
    list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
    list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
    
    sequence = tokenizer.build_inputs_with_special_tokens(list1, list2)
    

    Note that in that case, you have to generate the sequences list1 and list2 still without any special tokens, as you have already done correctly.