conv-neural-network object-detection yolo darknet

YOLOv3 SPP and YOLOv3 difference?

I couldn't find any good explanation about YOLOv3 SPP which has better mAP than YOLOv3. The author himself states YOLOv3 SPP as this on his repo:

YOLOv3 with spatial pyramid pooling, or something

But still I don't really understand it. In yolov3-spp.cfg I notice there are some additions

575 ### SPP ###
576 [maxpool]
577 stride=1
578 size=5
579 
580 [route]
581 layers=-2
582 
583 [maxpool]
584 stride=1
585 size=9
586 
587 [route]
588 layers=-4
589 
590 [maxpool]
591 stride=1
592 size=13
593 
594 [route]
595 layers=-1,-3,-5,-6
596 
597 ### End SPP ###
598 
599 [convolutional]
600 batch_normalize=1
601 filters=512
602 size=1
603 stride=1
604 pad=1
605 activation=leaky

Anybody can give further explanation about how YOLOv3 SPP works? Why layers -2, -4 and -1, -3, -5, -6 are chosen in [route] layers? Thanks.

Solution

Finally some researchers published a paper about SPP application in Yolo https://arxiv.org/abs/1903.08589.

For yolov3-tiny, yolov3, and yolov3-spp differences :

yolov3-tiny.cfg uses downsampling (stride=2) in Max-Pooling layers
yolov3.cfg uses downsampling (stride=2) in Convolutional layers
yolov3-spp.cfg uses downsampling (stride=2) in Convolutional layers + gets the best features in Max-Pooling layers

But they got only mAP = 79.6% on Pascal VOC 2007 test with using Yolov3SPP-model on original framework.

But we can achive higher accuracy mAP = 82.1% even with yolov3.cfg model by using AlexeyAB's repository https://github.com/AlexeyAB/darknet/issues/2557#issuecomment-474187706

And for sure we can achieve even higher mAP with yolov3-spp.cfg using Alexey's repo.

Original github question : https://github.com/AlexeyAB/darknet/issues/2859