I would like to use a language model such as Bert to get a feature vector for a certain text describing a medical condition.
As there are many words in the text unknown to most pre-trained models and tokenizers, I wonder which steps are required to achieve this task?
Using a pre-trained model seems beneficial to me since the dataset describing the medical conditions is quite small.
Yes, this question is too general to be on Stack Overflow, but I'll try to give some helpful pointers.
Try to look for any existing medical
pre-trained models.
Otherwise, fine-tune BERT/RoBERTa on your domain or whatever downstream task (classification/Question Answering) you're working on such that it captures the unknown medical terms in your corpus.