Search code examples
pythonnlphuggingface-transformers

Fine Tuning Bert on Medical Dataset


I would like to use a language model such as Bert to get a feature vector for a certain text describing a medical condition.

As there are many words in the text unknown to most pre-trained models and tokenizers, I wonder which steps are required to achieve this task?

Using a pre-trained model seems beneficial to me since the dataset describing the medical conditions is quite small.


Solution

  • Yes, this question is too general to be on Stack Overflow, but I'll try to give some helpful pointers.

    1. Try to look for any existing medical pre-trained models.

    2. Otherwise, fine-tune BERT/RoBERTa on your domain or whatever downstream task (classification/Question Answering) you're working on such that it captures the unknown medical terms in your corpus.