Search code examples
machine-learningkerasdeep-learningyolo

Understanding How YOLO is trained


I'm trying to understand how YOLO (v2) is trained. To do so, I'm using this keras implementation https://github.com/experiencor/keras-yolo2 to train YOLO from scratch on VOC dataset (I'm open to other implementation, but I never worked with pytorch, so, keras implementation will be first choice).

As I understood YOLO, it is first trained for classification on imageNet, then these trained weights (for classification) should be use somewhere when training yolo for regression (to detect bounding boxes). In most code I found on internet to train yolo from scratch (for regression), I don't see the part where these classification weight are loaded. When does this happen? when are the classification weights used in training yolo regression?

Is my understanding as described above correct?


Solution

  • You have two options:

    • Use pre-trained weights for the whole detector (backend + frontend , i.e. the classification network + the detector).
    • Use pre-trained weights for the backend only.

    All is explained bellow https://github.com/experiencor/keras-yolo2#2-edit-the-configuration-file on the link you gave.

    In the code, loading of pre-trained weights for the whole model is done here. It is optional.

    Pre-trained weights for backend is mandatory (according to the tutorial), in the code it is done here (example for full Yolo). Note that you should have downloaded the backend weights before creating the model as stated in the tutorial or at the beginning of the file.

    Edit 1

    If your number of classes changes, the number of filters in the detector part (the front-end) will change as the classification vector size changes. However the back-end (the feature extractor, i.e the backbone) stays the same even if the number of classes changes.

    You can use pre-trained any pre-trained weights that match the size for the backbone, but for the whole network you cannot if the number of classes varies. For instance you cannot use the Racoon's weights for a dog and cat detector.

    You cannot use YoloV2 original weights to initialise this network because the format is different between Darknet and Keras, you first have to translate them into Keras format.

    It's fine to use only backbone pre-trained weights if you have enough training data.

    Note that there is an additional option called transfert learning. If you have a pre-trained networks (backbone and front-end) you can extract the backbone weights and use them to initialise your network backbone.

    Edit 2

    No, front-end and backend are not strictly speaking two separate networks: they are two chained networks. In fact in most deep learning frameworks such as PyTorch, Keras or Tensorflow, any layer can be considered as a network (Fully Connected, Convolutive, MaxPool, ...).

    A "networks" is just an object that represents an arbitrary complex mathematical function mapping inputs to outputs on which Automatic Differentiation can be applied (you must define forward and backward propagation).

    In a single shot object detector such as Yolo it more relevant to think about the whole network as the chain of two networks: the backbone and the detector. This representation allows more generic construct and a wider variety of tuning (i.e use a more performant backbone or a lightweight one).

    Yes you are right, bounding box regression and label classification happened at the very end of the whole networks, hence in the front-end.

    The front-end can have an arbitrary number of layers, the only constraint is in its last layer which should respect a specific channel size (i.e a given number of filters) that is always constrained by the number of classes you want classify.

    Usually the number of channels in the last output layer should be numberOfClasses + 4 where numberOfClasses includes the background class and where the number 4 represents the four coordinates of the bounding box. This example is simplified a lot, I advice you to read Yolo papers to have a better understanding of the network structure.

    It appears that there is only one trainable layer (a 2D Conv here) in the detector network. Note the size of the output that is constrained by the number of classes: self.nb_box * (4 + 1 + self.nb_class).

    This layer parameters are then initialised with a random distribution.

    Concerning your last question I think you are correct about the procedure for transfert learning, that should work.