Search code examples
javaxml-parsingapache-tika

Parse meta tag and get HTML content from body with Tika


I parse files with the great Apache Tika library. I want to extract the metatags with my own parser and then get the content only from the <body>-tag as HTML and store it in a database.

I have tried this now for hours/days :-(, but cannot find a solution:

  • When I use the ToHTMLContentHandler after the <body>-tag I get exceptions with an invalid namespace without the <html>-tag.
  • BodyContentHandler just returns the body text without HTML tags.
  • The tika-app seems to use a TransformerHandler to get HTML (I have never heard of this kind of handlers before.) Can I use this to just get the HTML from the <body>-tag and parse the meta-tags myself? Is this a better way than to use a ToHTMLContentHandler?

Solution

  • Check to see if following links help you a bit..

    Content Detection, Metadata and Content Extraction with Apache Tika

    Parsing HTML with Apache Tika