I am using Apache Tika for extracting the content of uploaded files and I do not want to parse files which are having embedded image/s. As of now, I am using ToXMLContentHandler and try to find <img>
tag.
val parser = new AutoDetectParser()
val handler = new ToXMLContentHandler()
val metaData = new Metadata
parser.parse(stream, handler, metaData, getParseContext)
val xmlFileContent = XML.loadString(handler.toString)
val isDocHasImg = (xmlFileContent \\ "body" \\ "img").toList.nonEmpty
Is there any better solution to achieve this? I am using Scala.
If anyone is looking for solution, you can make use of EmbeddedDocumentExtractor class.
class EmbeddedImageFinder() extends EmbeddedDocumentExtractor {
override def shouldParseEmbedded(metadata: Metadata): Boolean = {
if(metadata.get("Content-Type").contains("image/")) {
isImageExists = true
}
false
}
override def parseEmbedded(stream: InputStream, handler: ContentHandler,
metadata: Metadata, outputHtml: Boolean): Unit = {}
}
then add this to the ParserContext
context.set(classOf[EmbeddedDocumentExtractor], new EmbeddedImageFinder)