Search code examples

Building XML file with SAX parser

I am parsing through an XML Wikipedia data dump and I'd like to pull out a page and make it into a new XML document with a stripped down version of the page. For example, of each page, I'm only interested in the title, id, timestamp, username, and text.

Here is a full Wikipedia page:

<redirect title="Computer accessibility" />
  <minor />
  <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/|]] ([[User talk:|talk]]) to last version by Gurch</comment>
  <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
  <sha1 />

What I'd like to end up with after the stripping is done would be something like this:

    <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>

Because of the sheer size of these documents I know I can't use DOM to handle this. I know how to set up a SAX parser but what would be the best way to build a new XML file while parsing the document?



  • You can use XMLFilterImpl and leave only content you need, here is the idea, both input and output are streams, so it can process XML of any size

        XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {
            public void startElement(String uri, String localName, String qName, Attributes atts)
                    throws SAXException {
                if (qName.equals("page")) {
                    super.startElement(uri, localName, qName, atts);
            public void endElement(String uri, String localName, String qName) throws SAXException {
                if (qName.equals("page")) {
                    super.endElement(uri, localName, qName);
            public void characters(char[] ch, int start, int length) throws SAXException {
                //super.characters(ch, start, length);
        Source src = new SAXSource(xr, new InputSource("1.xml"));
        Result res = new StreamResult(System.out);
        TransformerFactory.newInstance().newTransformer().transform(src, res);