Iam working on an image captioning tool and came across the apache tika
and would like to now which model does it use internally and how good are the results when compared with the state of the art right now in the market
Following Apache Tika changelog I came to this feature request for image captioning. There, the author stated that they used Google
'show and tell' neural network
mentioned in this blog post.
Also, here is a link to the paper if you want to compare it to the current state of the art methods.