I need to capture text within <page>
tags of my XML file. Whole text, with other tags, their attributes etc. I could do this using, for example, regular expressions, but I need this to be safe, so I would like to use SAXParser.
But I'm afraid that all information that ContentHandler can receive from SAXParser isn't enough to do this (cursor position at start of found XML tag, for example, would help a lot).
So, is there any other, safe way?
Instead of text within <page>
, it could be, for example, DOM tree, but I would prefer first way, for performance.
Okay, what I would do first is to create yourself a custom DefaultHandler
something like the following;
public class PrintXMLwithSAX extends DefaultHandler {
private int embedded = -1;
private StringBuilder sb = new StringBuilder();
private final ArrayList<String> pages = new ArrayList<String>();
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if(qName.equals("page")){
embedded++;
}
if(embedded >= 0) sb.append("<"+qName+">");
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
if(embedded >= 0) sb.append(new String(ch, start, length));
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if(embedded >= 0) sb.append("</"+qName+">");
if(qName.equals("page")) embedded--;
if(embedded == -1){
pages.add(sb.toString());
sb = new StringBuilder();
}
}
public ArrayList<String> getPages(){
return pages;
}
}
The DefaultHandler
(when parsed) runs through each element and calls startElement()
, characters()
, endElement()
and a few others. The code above checks if the element in startElement()
is a <page>
element. If so, it increments embedded
by 1. After that, each method checks if embedded
is >= 0. If it is, it appends the characters inside each element, as well as their tags (excluding attributes in this particular example) to the StringBuilder
object. endElement()
decrements embedded
when it finds the end of a </page>
element. If embedded falls back down to -1, we know that we are no longer inside a series of page elements, and so we add the result of the StringBuilder
to the ArrayList
pages
and start a fresh StringBuilder
to await another <page>
element.
Then you'll need to run the handler and then retrieve your ArrayList
of strings containing your <page>
elements like so;
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
PrintXMLwithSAX handler = new PrintXMLwithSAX();
InputStream input = new FileInputStream("C:\\Users\\me\\Desktop\\xml.xml");
saxParser.parse(input, handler);
ArrayList<String> myPageElements = handler.getPages();
Now myPageElements
is an ArrayList
containing all page elements and their contents as strings.
I hope this helps.