neural-network computer-vision deep-learning keras imagenet

Will YOLO anyhow perform differently from VGG-16. Will using it for image classification instead of VGG make sense?

I have already implemented image captioning using VGG as the image classification model. I have read about YOLO being a fast image classification and detection model and it is primarily used for multiple object detection. However for image captioning i just want the classes not the bounding boxes.

Solution

I completely agree with what Parag S. Chandakkar mentioned in his answer. YOLO and RCNN the two most used object detection models are slow if used just for classification compared to VGG-16 and other object classification networks. However in support of YOLO, I would mention that , you can create a single model for image captioning and image object detection.

YOLO generates a vector of length 1470.

Tune YOLO to generate number of classes as supported by your dataset i.e make YOLO generate a vector of 49*(number of classes in your dataset) + 98 + 392.
Use this vector to generate the Bounding boxes.
Further tune this vector to generate a vector of size equal to the number of classes. You can use a dense layer for the same.
Pass this vector to your language model for generating captions.

Thus to sum up, you can generate the bounding boxes first and then further tune that vector to generate captions.