Search code examples
tokenizehuggingface-transformers

About get_special_tokens_mask in huggingface-transformers


I use transformers tokenizer, and created mask using API: get_special_tokens_mask.
My Code

In RoBERTa Doc, returns of this API is "A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token". But I seem that this API returns "0 for a sequence token, 1 for a special token".
Is it all right?


Solution

  • You are indeed correct. I tested this for both transformers 2.7 and the (at the time of writing) current release of 2.9, and in both cases I do get the inverted results (0 for regular characters, and 1 for the special characters.

    For reference, this is how I tested it:

    import transformers
    
    tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")
    sentence = "This is a special sentence."
    
    encoded_sentence = tokenizer.encode(sentence)
    # [0, 152, 16, 10, 780, 3645, 4, 2]
    special_masks = tokenizer.get_special_tokens_mask(encoded_sentence)
    # [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
    

    I would suggest you report this issue in their repository, or ideally provide a pull request yourself to fix the issue ;-)