Search code examples
javaxmldomsaxjaxp

What is the advantage of using JAXP instead of DOM / SAX directly in Java?


Being new to XML parsing I'm trying to understand the different technologies. There is a confusing amount of different technologies for different needs:

  • W3C-DOM
  • XOM
  • jDom
  • JAXP
  • JAXB
  • DOM
  • SAX
  • StAX
  • TrAX
  • Woodstox
  • dom4j
  • Crimson
  • VTD-XML
  • Xerces-J
  • Castor
  • XStream
  • ...

Just to name a few.

DOM and SAX seem to be a low-level way for parsing and working on XML, so I decided to focus on the ones that get mentioned the most in different sources and are low-level:

DOM, SAX, JAXP.

I've read about parsers in general here on stackoverflow, JAXP-Tutorial from Oracle, XML-Parsing in general, and so on.

I've also tried some tutorials like this german one and others.

I'm grasping a little bit about DOM and SAX now, but the reason to use JAXP is still beyond me. It seems to be more of an interface to use DOM, SAX, ... internally, but why not use DOM or SAX directly?

What is the advantage of using JAXP in layman's-terms?


Solution

  • (Although you haven't said so explicitly, your question seems to relate exclusively to the Java world, and this answer reflects that.)

    JAXP is a set of interfaces covering XML parsing, XSLT transformation, and XML schema validation. If we just focus on the XML parsing side, its main contribution is to provide a mechanism for locating an XML parser implementation, so your source code isn't locked into a particular product. Frankly that's of limited value these days; the only two SAX/DOM parsers in common use are the one embedded in the JDK, and Apache Xerces. Apache Xerces is better in every respect except that you need to download it separately.

    As for the other parsing interfaces, they break down into two categories: event-based APIs and tree-based APIs. Tree-based APIs are much easier to work with, but can use a lot of memory when handling large documents.

    The two dominant event-based APIs are SAX (push) and StAX (pull). Pull parsing is something many programmers find easier because you can use the program stack to maintain state information; unfortunately though the StAX API is a bit buggy - different implementations have fixed its gaps in different ways. The most complete and reliable implementation of StAX is the Woodstox parser; the most complete and reliable implementation of SAX is Apache Xerces. But don't attempt to use an event-based parsing approach unless your application really needs that level of performance (and unless you have the level of experience needed to avoid losing all the performance gains at the application level.)

    For tree-based APIs, the DOM remains dominant solely because it was defined by W3C and is implemented in the JDK, and is therefore perceived as "standard"; also it's the one mentioned in all the books on the subject. However, of all the tree models, it is unquestionably the worst designed (mainly because it predates the introduction of namespaces). Alternatives include JDOM2, DOM4J, XOM, and AXIOM. I tend to recommend JDOM2 or XOM.