machine-learning neural-network artificial-intelligence object-detection yolo

Unable to understand YOLOv4 architecture

I was going through yolov4 paper where the authors have mentioned Backbone(CSP DARKNET-53), Neck (SPP followed by PANet) & than Head(YOLOv3). Hence is the architecture something like this:

CSP Darknet-53-->SPP-->PANet-->YOLOv3(106 layers of YOLOv3).

Does this mean YOLOv4 incorporates entire YOLOv3?

Solution

First, what is YOLOv3 composed of?

YOLOv3 is composed of two parts:

Backbone or Feature Extractor --> Darknet53
Head or Detection Blocks --> 53 layers

The head is used for (1) bounding box localization, and (2) identify the class of the object inside the box.

In the case of YOLOv4, it uses the same "Head" with that of YOLOv3.

To summarize, YOLOv4 has three main parts:

Backbone --> CSPDarknet53
Neck (Connects the backbone with the head) --> SPP, PAN
Head --> YOLOv3's Head

References:

Section 1.A. in https://ieeexplore.ieee.org/document/9214094
Page 5 of http://arxiv.org/abs/2004.10934