machine-learning deep-learning classification vgg-net transfer-learning

Why I need pre-trained weight in transfer learning

I'm learning transfer learning with some pre-trained models (vgg16, vgg19,…), and I wonder why I need to load pre-trained weight to train my own dataset.

I can understand if the classes in my dataset are included in the dataset that the pre-trained model is trained with. For example, VGG model was trained with 1000 classes in Imagenet dataset, and my model is to classify cat-dog, which are also in the Imagenet dataset. But here the classes in my dataset are not in this dataset. So how the pre-trained weight can help?

Solution

You don't have to use a pretrained network in order to train a model for your task. However, in practice using a pretrained network and retraining it to your task/dataset is usually faster and often you end up with better models yielding higher accuracy. This is especially the case if you do not have a lot of training data.

Why faster?

It turns out that (relatively) independent of the dataset and target classes, the first couple of layers converge to similar results. This is due to the fact that low level layers usually act as edge, corner and other simple structure detectors. Check out this example that visualizes the structures that filters of different layers "react" to. Having already trained the lower layers, adapting the higher level layers to your use case is much faster.

Why more accurate?

This question is harder to answer. IMHO it is due to the fact that pretrained models that you use as basis for transfer learning were trained on massive datasets. This means that the knowledge acquired flows into your retrained network and will help you to find a better local minimum of your loss function.

If you are in the compfortable situation that you have a lot of training data you should probably train a model from scratch as the pretained model might "point you in the wrong direction". In this master thesis you can find a bunch of tasks (small datasets, medium datasets, small semantical gap, large semantical gap) where 3 methods are compared : fine tuning, features extraction + SVM, from scratch. Fine tuning a model pretrained on Imagenet is almost always a better choice.