I have already implemented image captioning using VGG as the image classification model. I have read about YOLO being a fast image classification and detection model and it is primarily used for multiple object detection. However for image captioning i just want the classes not the bounding boxes.
I completely agree with what Parag S. Chandakkar mentioned in his answer. YOLO and RCNN the two most used object detection models are slow if used just for classification compared to VGG-16 and other object classification networks. However in support of YOLO, I would mention that , you can create a single model for image captioning and image object detection.
YOLO generates a vector of length 1470.
Tune YOLO to generate number of classes as supported by your dataset i.e make YOLO generate a vector of 49*(number of classes in your dataset) + 98 + 392.
Use this vector to generate the Bounding boxes.
Thus to sum up, you can generate the bounding boxes first and then further tune that vector to generate captions.