Search code examples
javaxmlsaxsaxparser

Way to capture part XML code in SAXParser


I need to capture text within <page> tags of my XML file. Whole text, with other tags, their attributes etc. I could do this using, for example, regular expressions, but I need this to be safe, so I would like to use SAXParser.

But I'm afraid that all information that ContentHandler can receive from SAXParser isn't enough to do this (cursor position at start of found XML tag, for example, would help a lot).

So, is there any other, safe way?

Instead of text within <page>, it could be, for example, DOM tree, but I would prefer first way, for performance.


Solution

  • Okay, what I would do first is to create yourself a custom DefaultHandler something like the following;

    public class PrintXMLwithSAX extends DefaultHandler {
    
      private int embedded = -1;
      private StringBuilder sb = new StringBuilder();
      private final ArrayList<String> pages = new ArrayList<String>();    
    
    
      @Override
      public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
          if(qName.equals("page")){
              embedded++;
          }
          if(embedded >= 0) sb.append("<"+qName+">");
      }
    
      @Override
      public void characters(char[] ch, int start, int length) throws SAXException {
          if(embedded >= 0) sb.append(new String(ch, start, length));
      }
    
      @Override
      public void endElement(String uri, String localName, String qName) throws SAXException {
          if(embedded >= 0) sb.append("</"+qName+">");
          if(qName.equals("page")) embedded--;
          if(embedded == -1){
              pages.add(sb.toString());
              sb = new StringBuilder();
          }
      }
    
      public ArrayList<String> getPages(){
          return pages;
      }
    
    }
    

    The DefaultHandler (when parsed) runs through each element and calls startElement(), characters(), endElement() and a few others. The code above checks if the element in startElement() is a <page> element. If so, it increments embedded by 1. After that, each method checks if embedded is >= 0. If it is, it appends the characters inside each element, as well as their tags (excluding attributes in this particular example) to the StringBuilder object. endElement() decrements embedded when it finds the end of a </page> element. If embedded falls back down to -1, we know that we are no longer inside a series of page elements, and so we add the result of the StringBuilder to the ArrayList pages and start a fresh StringBuilder to await another <page> element.

    Then you'll need to run the handler and then retrieve your ArrayList of strings containing your <page> elements like so;

        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();
        PrintXMLwithSAX handler = new PrintXMLwithSAX();
        InputStream input = new FileInputStream("C:\\Users\\me\\Desktop\\xml.xml");
        saxParser.parse(input, handler);
        ArrayList<String> myPageElements = handler.getPages();
    

    Now myPageElements is an ArrayList containing all page elements and their contents as strings.

    I hope this helps.