Search code examples
javasax

Building XML file with SAX parser


I am parsing through an XML Wikipedia data dump and I'd like to pull out a page and make it into a new XML document with a stripped down version of the page. For example, of each page, I'm only interested in the title, id, timestamp, username, and text.

Here is a full Wikipedia page:

<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
  <id>381202555</id>
  <timestamp>2010-08-26T22:38:36Z</timestamp>
  <contributor>
    <username>OlEnglish</username>
    <id>7181920</id>
  </contributor>
  <minor />
  <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
  <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
  <sha1 />
  </revision>
</page>

What I'd like to end up with after the stripping is done would be something like this:

<page>
  <title>AccessibleComputing</title>
  <id>10</id>
  <revision>
    <timestamp>2010-08-26T22:38:36Z</timestamp>
    <contributor>
      <username>OlEnglish</username>
    </contributor>
    <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
  </revision>
</page>

Because of the sheer size of these documents I know I can't use DOM to handle this. I know how to set up a SAX parser but what would be the best way to build a new XML file while parsing the document?

Thanks


Solution

  • You can use XMLFilterImpl and leave only content you need, here is the idea, both input and output are streams, so it can process XML of any size

        XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {
            public void startElement(String uri, String localName, String qName, Attributes atts)
                    throws SAXException {
                if (qName.equals("page")) {
                    super.startElement(uri, localName, qName, atts);
                }
            }
    
            public void endElement(String uri, String localName, String qName) throws SAXException {
                if (qName.equals("page")) {
                    super.endElement(uri, localName, qName);
                }
            }
    
            public void characters(char[] ch, int start, int length) throws SAXException {
                //super.characters(ch, start, length);
            }
        };
        Source src = new SAXSource(xr, new InputSource("1.xml"));
        Result res = new StreamResult(System.out);
        TransformerFactory.newInstance().newTransformer().transform(src, res);