I use transformers tokenizer, and created mask using API: get_special_tokens_mask.
My Code
In RoBERTa Doc, returns of this API is "A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token". But I seem that this API returns "0 for a sequence token, 1 for a special token".
Is it all right?
You are indeed correct. I tested this for both transformers 2.7 and the (at the time of writing) current release of 2.9, and in both cases I do get the inverted results (0
for regular characters, and 1
for the special characters.
For reference, this is how I tested it:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")
sentence = "This is a special sentence."
encoded_sentence = tokenizer.encode(sentence)
# [0, 152, 16, 10, 780, 3645, 4, 2]
special_masks = tokenizer.get_special_tokens_mask(encoded_sentence)
# [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
I would suggest you report this issue in their repository, or ideally provide a pull request yourself to fix the issue ;-)