Search code examples
machine-learningnlpartificial-intelligence

When should I train my own models and when should I use pretrained models?


Is it recommended to train my own models for things like sentiment analysis, despite only having a very small dataset (5000 reviews), or is it best to use pretrained models which were trained on way larger datasets, however aren't "specialized" on my data.

Also, how could I train my model on my data and then later use it on it too? I was thinking of an iterative approach where the training data would be randomly selected subset of my total data for each learning epoch.


Solution

  • I would go like this:

    • Try the pre-trained model and see how it goes
    • If results are non satisfactory, you can fine tune it (see this tutorial). Basically, you are using your own examples to change the weights of the pre-trained model. This should improve the results, but it depends on how your data is and how many examples you can provide. The more you have, the better it should be (I would try to use 10-20k at least)

    Also, how could I train my model on my data and then later use it on it too?

    Be careful to distinguish between pre-train and fine-tuning.

    For pre-training you need a huge amount of text (like billions of characters), it is very resource demanding, and tipically you don't want to do that, unless for a very good reason (for example, a model for your target language does not exist).

    Fine-tuning requires much much less examples (some tents of thousands), it take tipycally less than a day on a single GPU and allow you to exploit pre-trained model created by someone else.

    From what you write, I would go with fine-tune.

    Of course you can save the model for later, as you can see in the tutorial I linked above:

    model.save_pretrained("my_imdb_model")