Search code examples
nlptext-classificationhuggingface-transformersbert-language-model

Customize the encode module in huggingface bert model


I am working on a text classification project using Huggingface transformers module. The encode_plus function provides the users with a convenient way of generating the input ids, attention masks, token type ids, etc. For instance:

from transformers import BertTokenizer

pretrained_model_name = 'bert-base-cased'
bert_base_tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)

sample_text = 'Bamboo poles, ‍installation by an unknown building constructor #discoverhongkong #hongkonginsta'

encoding = bert_base_tokenizer.encode_plus(
        cleaned_tweet, hashtag_string,
        max_length=70,
        add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
        return_token_type_ids=True,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors='pt',  # Return PyTorch tensors
    )

print('*'*20)
print(encoding['input_ids'])
print(encoding['attention_mask'])
print(encoding['token_type_ids'])
print('*'*20)

However, my current project requires me to generate customized ids for a given text. For instance, for a list of words [HK, US, UK], I want to generate ids for these words and let other words' ids which do not exist in this list as zero. These ids are used to find embedding in another customized embedding matrix, not from pretrained bert module.

How can I achieve this kind of customized encoder? Any suggestions and solutions are welcomed! Thanks~


Solution

  • I think you can use the <unusedxxx> tokens in the BERT vocab and add your custom tokens there. So now you can easily refer to them with a valid token ID.