I parse files with the great Apache Tika library. I want to extract the metatags with my own parser and then get the content only from the <body>
-tag as HTML and store it in a database.
I have tried this now for hours/days :-(, but cannot find a solution:
ToHTMLContentHandler
after the <body>
-tag I get exceptions with an invalid namespace without the <html>
-tag.BodyContentHandler
just returns the body text without HTML tags.tika-app
seems to use a TransformerHandler
to get HTML (I have never heard of this kind of handlers before.) Can I use this to just get the HTML from the <body>
-tag and parse the meta-tags myself? Is this a better way than to use a ToHTMLContentHandler
?Check to see if following links help you a bit..
Content Detection, Metadata and Content Extraction with Apache Tika