Search code examples
apache-tika

Detect an Image using apache Tika in any document?


I am using Apache Tika for extracting the content of uploaded files and I do not want to parse files which are having embedded image/s. As of now, I am using ToXMLContentHandler and try to find <img> tag.

    val parser   = new AutoDetectParser()
    val handler  = new ToXMLContentHandler()
    val metaData = new Metadata
    parser.parse(stream, handler, metaData, getParseContext)

    val xmlFileContent = XML.loadString(handler.toString)
    val isDocHasImg    = (xmlFileContent \\ "body" \\ "img").toList.nonEmpty

Is there any better solution to achieve this? I am using Scala.


Solution

  • If anyone is looking for solution, you can make use of EmbeddedDocumentExtractor class.

    class EmbeddedImageFinder() extends EmbeddedDocumentExtractor {
      override def shouldParseEmbedded(metadata: Metadata): Boolean = {
        if(metadata.get("Content-Type").contains("image/")) {
           isImageExists = true
        }
        false
      }
    
      override def parseEmbedded(stream: InputStream, handler: ContentHandler,
                                 metadata: Metadata, outputHtml: Boolean): Unit = {}
    }
    
    

    then add this to the ParserContext

    context.set(classOf[EmbeddedDocumentExtractor], new EmbeddedImageFinder)