I am currently using the SAX interface of the libxml library to parse a large number (around 60000) of XML documents less than 1Mb in size. I have chosen SAX as I thought it would be the most efficient. Would there be much of a difference in performance in this use case as with say a DOM parser?
Also, in my current approach I have an enum with a large number of states which I use in a switch statement in my startElement/endElement handlers. The number of states is growing quite large and becoming unmanageable. Is there a better way to handle this problem in libxml? For example, I've noticed some Java libraries allow you to create multiple instances of parsers so when you enter a certain element you can delegate to another parser for that particular element.
When you say "efficient", I guess you are talking about machine efficiency? But programmer efficiency is much more important, and as you've discovered, writing SAX applications to process complex XML requires a lot of complex code that is hard to develop and hard to debug.
You haven't said what the output of your processing should be. By default, I would start by writing it in the most programmer-efficient language available, typically XQuery or XSLT, and only resort to a lower-level language if you can't achieve the performance requirements that way.