By implementing the org.xml.sax.helpers.DefaultHandler and creating a ContentHandler for Tika I can parse a html file and get any tag and its attributes by overriding startElement. In general, this works wonderfully (very good performance and can handle large files). However, for the base tag
<base href="http://www.w3schools.com/images/" target="_blank">
the attributes are always null. All the attributes for other tags work perfectly. Wondering why that would be?
@Override
public void startElement(String uri, String local, String name, Attributes attributes) {
if (XHTML.equals(uri)) {
if("base".equals(local)) {
String href = attributes.getValue("", "href"); // always null
System.out.println("base href: " + href);
}
}
}
It is not possible to get attributes from the base tag due to a known bug in Tika XHTMLContentHandler doesn't pass attributes of html element