Search code examples
javaxmlsaxstax

extracting xml node(not text but complete xml ) and with other test nodes from xml file using SAX parser in java


I have to read from large xml files each ranging ~500MB. The batch processes typically 500 such files in each run. I have to extract text nodes from it and at the same time extract xml nodes from it. I used xpath DOM in java for easy of use but that doesn't work due to memory issues as i have limited resources.

I intent to use SAX or stax in java now - the text nodes can be easily extracted but i don't know how to extract xml nodes from xml using sax.

a sample:

<?xml version="1.0"?>
<Library>
  <Book name = "ABC">
    <Author>John</Author>
    <PrintingCompanyDT><Printer>Sam</Printer><Printmachine>Laser</Printmachine>    
    <AssocPrint>Oreilly</AssocPrint> </PrintingCompanyDT>
  </Book>
  <Book name = "123">
    <Author>Mason</Author>
    <PrintingCompanyDTv<Printervkelly</Printer><Printmachine>DOTPrint</Printmachine>
    <AssocPrint>Oxford</AssocPrint> </PrintingCompanyDT>
  </Book>
</Library>

The expected result: 1)Book: ABC:
Author:John
PrintCompany Detail XML:

<PrintingCompanyDT>
  <Printer>Sam</Printer>
  <Printmachine>Laser</Printmachine>
  <AssocPrint>Oreilly</AssocPrint> 
</PrintingCompanyDT>


2) Book: 123
Author : Mason
PrintCompany Detail XML:

<PrintingCompanyDT>
  <Printer>kelly</Printer>
  <Printmachine>DOTPrint</Printmachine>
  <AssocPrint>Oxford</AssocPrint>
</PrintingCompanyDT>


If i try in the regular way of appending characters in public void characters(char ch[], int start, int length) method I get the below
1)Book: ABC:
Author:John
PrintCompany Detail XML :

Sam 
  Laser
      Oreilly

exactly the content and spaces.

Can somebody suggest how to extract an xml node as it is from a xml file through SAX or StaX parser in java.


Solution

  • I'd be tempted to use XOM for this sort of task rather than SAX or StAX directly. XOM is a tree-based representation similar to DOM or JDOM but it has support for processing XML "twigs" in a kind of semi-streaming fashion, ideal for your kind of case where you have many similar elements that can be processed independently of one another. Also every Node has a toXML method that prints the node as XML.

    import nu.xom.*;
    
    public class LibraryProcessor extends NodeFactory {
      private Nodes empty = new Nodes();
      private bookNum = 0;
    
      /** Called for each closing tag in the XML */
      public Nodes finishMakingElement(Element element) {
        if("Book".equals(element.getLocalName())) {
          bookNum++;
          // process the complete Book element ...
          processBook(element);
          // ... and throw it away
          return empty;
        } else {
          // process other elements (except Book) in the normal way
          return super.finishMakingElement(element);
        }
      }
    
      private void processBook(Element book) {
        System.out.println(bookNum + ": " +
            book.getAttributeValue("name"));
        System.out.println("Author: " +
            book.getFirstChildElement("Author").getValue());
        System.out.println("PrintCompany Detail XML: " +
            book.getFirstChildElement("PrintingCompanyDT").toXML());
      }
    
      public static void main(String[] args) throws Exception {
        Builder builder = new Builder(new LibraryProcessor());
        builder.build(new File(args[0]));
      }
    }
    

    This will work its way through the XML document, calling processBook once for each Book element in turn. Within processBook you have access to the whole Book XML tree as XOM nodes, but without having to load the entire file into memory in one go - the best of both worlds. The "Factories, Filters, Subclassing, and Streaming" section of the XOM tutorial has more detail on this technique.

    This example just shows the most basic bits of the XOM API, but it also provides powerful XPath support if you need to do more complex processing. For example, you can directly access the PrintMachine element within processBook using

    Element machine = (Element)book.query("PrintingCompanyDT/PrintMachine").get(0);
    

    or if the structure is not so regular, for example if PrintingCompanyDT is sometimes a direct child of Book and sometimes deeper (e.g. a grandchild) then you can use a query like

    Element printingCompanyDT = (Element)book.query(".//PrintingCompanyDT").get(0);
    

    (// being the XPath notation for finding descendants at any level, as opposed to / which looks only for direct children).