Search code examples
javaxmlxpathsaxentities

Which Java XML Parsing method to use when rewriting an XML file?


Edited for a little clarity.

I'm writing a Java application that takes an XML file and rewrites it if the information in the file needs to be updated. An example of an XML file is below:

<!DOCTYPE book PUBLIC "myDTD.dtd" [

<!ENTITY % ent SYSTEM "entities.ent">
%ent;

]>

<book id="EXDOC" label="beta" lang="en">
   <title>Example Document</title>
   <bookinfo>
      <authorgroup>
         <author>
            <firstname>George</firstname>
            <surname>Washington</surname>
         </author>
         <author>
            <firstname>Barbara</firstname>
            <surname>Bush</surname>
         </author>
      </authorgroup>
      <pubsnumber>E12345</pubsnumber>
      <releaseinfo/>
      <pubdate>March 2016</pubdate>
      <copyright>
         <year>2012, 2016</year>
         <holder>Company and/or its affiliates. All rights reserved.</holder>
      </copyright>
      <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="Abstract.xml" parse="xml"/>
      <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="legal.xml" parse="xml"/>
   </bookinfo>
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="preface.xml" parse="xml"/>
...

I need to grab certain nodes and check that information, and if the information is incorrect, update the node to have the correct text. I might also want to add/remove nodes as needed.

For example, in the node, I might need to change the copyright year to list the most recent year. Or, I might need to add a writer to the element.

At the moment, I create an instance of a SAX Parser, validate the XML file to create a Document from this instance (which in turn resolves any entities), read the nodes from the Document, and update the text with the setTextContent() method. I then take the resulting Document at the end of all my updates for the particular file and use a DOMSource and Transformer factory to output my file:

 TransformerFactory transformerFactory;
 transformerFactory = TransformerFactory.newInstance();
 Transformer transformer = transformerFactory.newTransformer();
 DOMSource source = new DOMSource(doc);
 StreamResult result = new StreamResult(new File(uri));
 transformer.transform(source, result);

This presents some limitations, though, that I really want to get around. For one, if the inline text has a text entity &something;, I want to keep the entity as is. At the moment, my entity resolves to the text itself when the file is rewritten.

So for example, if I have

<!ENTITY something "Something">

if my file has something like:

<para> There's a &something; here.</para>

When I rewrite, I want it to say:

<para> Here's a &something; there.</para>

But the entity resolves and the file becomes:

<para>Here's a Something there.</para>

I'm not sure what to do with my entityResolver class such that it doesn't automatically resolve these entities when I read the nodes without breaking the rest of my code. I also have another class I use with XPATH that pulls certain information from the doc to compare the information in the XML file with what is recorded in the database, so I can't just not set the entityResolver otherwise that XPATH expression breaks entirely.

I suppose I could have a separate parser for reading/writing the XML file and then the SAX parser that's necessary to grab the info from our database, but I want to do this as clean as possible.

Any help would be appreciated...


Solution

  • Unfortunately, you cannot tell the transformation engine not to expand the entity references. That happens as the XML is parsed, so they are lost by the time the XML content is being transformed.

    What about a multi-stage transformation scenario where you:

    1. Replace the entity reference with entity-reference-like tokens i.e. replace &something; with ¶something;, as Michael Kay suggested.
    2. Perform your transformation to adjust the content as needed, which won't expand the entity references and will preserve your entity-reference-like tokens. And if you do need the entities resolved in order to verify those entities information, you could also load the original XML doc (with expanded entities) and cross-reference between the documents.

    3. Change the entity-reference-like tokens in the transformed output back into entity-references with another find/replace.