nlp text-classification bert-language-model huggingface-transformers pytorch-lightning

What is the simplest way to continue training a pre-trained BERT model, on a specific domain?

I want to use a pre-trained BERT model in order to use it on a text classification task (I'm using Huggingface library). However, the pre-trained model was trained on domains that are different than mine, and I have a large unannotated dataset that can be used for fine-tuning it. If I use only my tagged examples and fine-tune it "on the go" while training on the specific task (BertForSequenceClassification), the dataset is too small for adapting the language model for the specific domain. What it the best way to do so? Thanks!

Solution

Let's clarify a couple of points first to reduce some ambiguity.

BERT uses two pretraining objectives: Masked Language Modeling (MLM) and Next Sentence Prediction.
You mentioned having a large unannotated dataset, which you plan on using to fine-tune your BERT model. This is not how fine-tuning works. In order to fine-tune your pretrained model, you would need an annotated dataset i.e. document & class pair for sequence classification downstream task.

So what can you do? First, extend your general domain tokenizer with your unannotated dataset consisting of domain-specific vocabulary. Then, using this extended tokenizer you can continue pretraining on MLM and/or NSP objectives to modify your word embeddings. Finally, fine-tune your model using an annotated dataset.