Search code examples
pytorchtokenizehuggingface-transformersbert-language-modelhuggingface-tokenizers

Equivalent to tokenizer() in Transformers 2.5.0?


I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0.

# Converting pretrained BERT classification model to regression model
# i.e. extracting base model and swapping out heads

from transformers import BertTokenizer, BertModel, BertConfig, BertForMaskedLM, BertForSequenceClassification, AutoConfig, AutoModelForTokenClassification
import torch
import numpy as np

old_model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=1) 
model.bert = old_model.bert

# Ensure that model parameters are equivalent except for classifier head layer
for param_name in model.state_dict():
    if 'classifier' not in param_name:
        sub_param, full_param = model.state_dict()[param_name], old_model.state_dict()[param_name] # type: torch.Tensor, torch.Tensor
        assert (sub_param.cpu().numpy() == full_param.cpu().numpy()).all(), param_name


tokenizer = BertTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

output_value = np.array(logits)[0][0]
print(output_value)

tokenizer is not callable with transformers 2.5.0, resulting the following:

TypeError                                 Traceback (most recent call last)
<ipython-input-1-d83f0d613f4b> in <module>
     19 
     20 
---> 21 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
     22 
     23 with torch.no_grad():

TypeError: 'BertTokenizer' object is not callable

However, attempting to replace tokenizer() with tokenizer.tokenize() results in the following:

TypeError                                 Traceback (most recent call last)
<ipython-input-2-1d431131eb87> in <module>
     21 
     22 with torch.no_grad():
---> 23     logits = model(**inputs).logits
     24 
     25 output_value = np.array(logits)[0][0]

TypeError: BertForSequenceClassification object argument after ** must be a mapping, not list

Any help would be greatly appreciated.


Solution

Using tokenizer.encode_plus() as suggested by @cronoik:

tokenized = tokenizer.encode_plus("Hello, my dog is cute", return_tensors="pt")

with torch.no_grad():
    logits = model(**tokenized)

output_value = np.array(logits)[0]
print(output_value)

Solution

  • Sadly their documentation for the old versions is broken, but you can use encode_plus as shown in the following (he oldest available documentation of encode_plus is from 2.10.0):

    import torch
    from transformers import BertTokenizer
    
    
    t = BertTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
    tokenized = t.encode_plus("Hello, my dog is cute", return_tensors='pt')
    print(tokenized)
    

    Output:

    {'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]]), 
    'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}