Search code examples
javaxmlxml-parsingxmlcatalog

Java catalog resolver for multiple catalog files (as well as catalog.xml) and DTDs (as well as XSDs) and pointer to more info than just the API?


I work with PTC Arbortext Editor which was written originally in the pre-XML (SGML) days of the late 1980s. A Java application uses org.custommonkey.xmlunit to diff XML files.

The diff tool fails to parse files where the files expect (on Windows) a semi-colon-separated list of absolute paths to various catalog file locations wherein it looks for catalog and/or catalog.xml files. These may use the CATALOG directive. There is use of PUBLIC identifier mapped to paths that are relative to the particular catalog file.

I am parsing XML using this catalog info which may contain file entities as well as XML inclusions.

For some use cases, I can set validating false and that works (it is reasonable to assume the two files are valid) but for some files I have to read the catalog info to resolve file entities in the XML.

I can ask the user to provide a list of absolute paths to their top-level catalog locations. However I am rather lost selecting a resolver and integrating it into my code.

I am using Java 1.8 but don't mind going to 10 if that would help/simplify. It looks like 9 had some simple support with javax.xml.catalog but isn't in 1.8 or 10.

I can provide my parsing code if that matters, but I'm not stuck on any one parser.

My code is below. I switched from LSParser to DocumentBuilder for the sake of setValidating(false).

Here are a couple excerpts from one of the files I'd like to be able to work with:

<?xml version="1.0" encoding="UTF-8"?>
<!--Arbortext, Inc., 1988-2016, v.4002-->
<!DOCTYPE Composer PUBLIC "-//Arbortext//DTD Composer 1.0//EN"
 "../doctypes/composer/composer.dtd" [
<!ENTITY % stock PUBLIC "-//Arbortext//DTD Fragment - ATI Stock filter list//EN" "../composer/stock.ent">
%stock;
]>
<?Pub Inc?>
<Composer>
<Label>Compose to PDF</Label>
 . . . 
<Resource>
<Label></Label>
<Documentation></Documentation>&epicGenerator;
&fileSerializer;
&serverProfiler;
&clientProfiler;
&xslTransformer;
&epicSerializer;
&switch;
&errorHandler;
&namespaceFixer;
&atiEventConverter;
&foPropagator;
&extensionHandler;
&ditaPostProcessor;
&ditaStyledElementsTranslator;
&atictFilter;
&applicabilityFilter;
</Resource>

And here are a few lines from one of the catalog files I need to reference:

PUBLIC "-//Arbortext//ENTITIES SAX Event Upstream Loop//EN" "upstreamLoop.ent"
PUBLIC "-//Arbortext//ENTITIES keyRef Resolver//EN" "keyRefResolver.ent"
PUBLIC "-//Arbortext//ENTITIES ATI Change Tracking Filter 1.0//EN" "atictFilter.ent"
PUBLIC "-//Arbortext//ENTITIES Font Filter 1.0//EN" "fontFilter.ent"
PUBLIC "-//Arbortext//ENTITIES Simple Attribute Cascader//EN" "simpleAttrCascader.ent"

Resources

StackOverflow

I also looked at Validate XML using XSD, a Catalog Resolver, and JAXP DOM for XSLT. I feel like it is unlikely to solve my problem, but could be wrong.

Online

I also reviewed the following web sites:

Test Case

I have uploaded Java code, directory structure, and XML to http://aapro.net/CatalogTest.zip

It should be possible to add something to my program which accepts a path to the Test/doctypes folder (the folder, not the catalog file therein), and then the CatalogTest.xml file should parse successfully with the "Validate" option the program prompts for. Other (expensive) SGML/XML-aware software can do so. The catalog resolver, once given the absolute path to the Test/doctypes folder, should be able to follow the CATALOG directive in the Test/doctypes/catalog file to the Test/other/forms/catalog file, to the Test/other/forms/forms.dtd. The parser should be able to parse Test/other/forms/forms.dtd and use it to validate Test/CatalogTest.xml.

Really, this whole process should be able to handle such catalog files OR catalog.xml files, and should be able to parse DTD or XSD files, and SGML or XML instances. But I don't actually care about SGML too much; there only a few milspec situations still around that use that in my working environment.

Multiple methods?

I'd be willing to try more than one resolver and/or parser, or let the user make the selection.

Java Code Inline

(Also in the aforementioned zip file)

import java.io.File;
import javax.swing.JFileChooser;
import javax.swing.JOptionPane;
import javax.swing.filechooser.FileNameExtensionFilter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

public class ParseXmlWithCatalog {

        public static void main(String[] args) {
                int validating = JOptionPane.showOptionDialog(null, "Do you want validation?", "Please choose \"Yes\" for validation",
                                JOptionPane.YES_NO_OPTION, JOptionPane.QUESTION_MESSAGE, null, null, JOptionPane.YES_OPTION);
                parseDoc(getFile(args), validating == JOptionPane.YES_OPTION);
        }

        private static boolean parseDoc(File inFile, boolean validate) {
                if (inFile == null) {
                        JOptionPane.showMessageDialog(null, "Failure opening input XML.");
                }
                try {
                        /*
                        System.setProperty(DOMImplementationRegistry.PROPERTY, "org.apache.xerces.dom.DOMImplementationSourceImpl");
                        DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
                        DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
                        LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
                        LSParserFilter filter = new InputFilter();
                        builder.setFilter(filter);
                */
                        DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
                        if (!validate) {
                                builderFactory.setValidating(false);
                                builderFactory.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
                        }
                        DocumentBuilder builder = builderFactory.newDocumentBuilder();

                        Document testDoc = builder.parse(inFile.getPath());
                        System.out.println(testDoc.getFirstChild().getNodeName());
                } catch (Exception exc) {
                        JOptionPane.showMessageDialog(null, "Failure parsing input XML: " + exc.getMessage());
                        return false;
                }
                return true;
        }

        public static File getFile(String[] args) {
                if (args.length > 1) {
                        JOptionPane.showMessageDialog(null, "Too many arguments.");
                        return null;
                }
                if (args.length == 1) {
                        return new File(args[0]);
                }
                JFileChooser fileChooser = new JFileChooser();
                fileChooser.setMultiSelectionEnabled(false);
                fileChooser.setDialogTitle("Select 1 XML file");
                FileNameExtensionFilter filter = new FileNameExtensionFilter("XML Files", "xml", "ditamap", "dita", "style");
                fileChooser.setFileFilter(filter);
                int response = fileChooser.showOpenDialog(null);
                if (response != JFileChooser.APPROVE_OPTION) {
                        // aborted
                        return null;
                }
                return fileChooser.getSelectedFile();

        }

}

Solution

  • The Apache XML Commons Resolver supports both OASIS XML Catalogs and the older OASIS TR9401 Catalogs format. See https://xerces.apache.org/xml-commons/components/resolver/.

    To enable catalog lookup in your test project, do as follows:

    1. Download XML Commons Resolver from http://xerces.apache.org/mirrors.cgi#binary.

    2. Extract resolver.jar and add it to your classpath.

    3. Create a text file called CatalogManager.properties and put it on your classpath. In this file, add the path to the catalog(s):

      catalogs=./doctypes/catalog
      

      The locations of catalog files can also be specifed via the xml.catalog.files Java system property.

    4. In ParseXmlWithCatalog.java, add an import statement and create an instance of CatalogResolver. Set that instance as the parser's EntityResolver:

      import org.apache.xml.resolver.tools.CatalogResolver;
      ...
      
      CatalogResolver cr = new CatalogResolver();
      builder.setEntityResolver(cr);