Search code examples
saxapache-tika

How to get href attribute from base tag using Tika Sax ContentHandler?


By implementing the org.xml.sax.helpers.DefaultHandler and creating a ContentHandler for Tika I can parse a html file and get any tag and its attributes by overriding startElement. In general, this works wonderfully (very good performance and can handle large files). However, for the base tag

<base href="http://www.w3schools.com/images/" target="_blank">

the attributes are always null. All the attributes for other tags work perfectly. Wondering why that would be?

@Override
public void startElement(String uri, String local, String name, Attributes attributes)  {
  if (XHTML.equals(uri)) {
    if("base".equals(local)) {
        String href = attributes.getValue("", "href"); // always null
        System.out.println("base href: " + href);
    }
  }
}

Solution

  • It is not possible to get attributes from the base tag due to a known bug in Tika XHTMLContentHandler doesn't pass attributes of html element