I want to use a pre-trained BERT model in order to use it on a text classification task (I'm using Huggingface library). However, the pre-trained model was trained on domains that are different than mine, and I have a large unannotated dataset that can be used for fine-tuning it. If I use only my tagged examples and fine-tune it "on the go" while training on the specific task (BertForSequenceClassification), the dataset is too small for adapting the language model for the specific domain. What it the best way to do so? Thanks!
Let's clarify a couple of points first to reduce some ambiguity.
So what can you do? First, extend your general domain tokenizer with your unannotated dataset consisting of domain-specific vocabulary. Then, using this extended tokenizer you can continue pretraining on MLM and/or NSP objectives to modify your word embeddings. Finally, fine-tune your model using an annotated dataset.