I am parsing through an XML Wikipedia data dump and I'd like to pull out a page and make it into a new XML document with a stripped down version of the page. For example, of each page, I'm only interested in the title, id, timestamp, username, and text.
Here is a full Wikipedia page:
<redirect title="Computer accessibility" />
<minor />
<comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/|]] ([[User talk:|talk]]) to last version by Gurch</comment>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
<sha1 />
What I'd like to end up with after the stripping is done would be something like this:
<text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
Because of the sheer size of these documents I know I can't use DOM to handle this. I know how to set up a SAX parser but what would be the best way to build a new XML file while parsing the document?
You can use XMLFilterImpl and leave only content you need, here is the idea, both input and output are streams, so it can process XML of any size
XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {
public void startElement(String uri, String localName, String qName, Attributes atts)
throws SAXException {
if (qName.equals("page")) {
super.startElement(uri, localName, qName, atts);
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("page")) {
super.endElement(uri, localName, qName);
public void characters(char[] ch, int start, int length) throws SAXException {
//super.characters(ch, start, length);
Source src = new SAXSource(xr, new InputSource("1.xml"));
Result res = new StreamResult(System.out);
TransformerFactory.newInstance().newTransformer().transform(src, res);