Search code examples
delphiexceptionentitydtdtxmldocument

Delphi - Can TXMLDocument be configured to ignore incorrect DTD entities?


I'm writing Delphi code with RAD Studio XE7. In one of my projects, I need to parse several SVG files to draw their content on the screen. For that I use the TXMLDocument parser.

One of my source SVG contains this XML data:

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 17.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Calque_1" xmlns:x="&ns_extend;" xmlns:i="&ns_ai;" xmlns:graph="&ns_graphs;"
 xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px" width="32px" height="32px"
 viewBox="0 0 32 32" enable-background="new 0 0 32 32" xml:space="preserve">
<metadata>
    <sfw  xmlns="&ns_sfw;">
        <slices></slices>
        <sliceSourceBounds  height="21.334" width="32" bottomLeftOrigin="true" y="1.833" x="-4.501"></sliceSourceBounds>
    </sfw>
</metadata>
<path fill="#29ABE2" d="M4,8h24v13.333h2.667v-16H1.334v16h2.667L4,8L4,8z M21.333,22.667c-0.256,0.536-1.527,0.967-2.667,1.181V24
h-5.333v-0.152c-1.14-0.215-2.411-0.645-2.667-1.181H-0.001V24c0,1.467,4,2.667,4,2.667h24c0,0,4-1.2,4-2.667v-1.333H21.333
L21.333,22.667z M26.667,25.333h-1.333V24h1.333V25.333z"/>
</svg>

I know that the content of the above XML is incomplete, and the well formatted SVG should contain this XML data instead:

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 17.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    <!ENTITY ns_extend "http://ns.adobe.com/Extensibility/1.0/">
    <!ENTITY ns_ai "http://ns.adobe.com/AdobeIllustrator/10.0/">
    <!ENTITY ns_graphs "http://ns.adobe.com/Graphs/1.0/">
    <!ENTITY ns_vars "http://ns.adobe.com/Variables/1.0/">
    <!ENTITY ns_imrep "http://ns.adobe.com/ImageReplacement/1.0/">
    <!ENTITY ns_sfw "http://ns.adobe.com/SaveForWeb/1.0/">
    <!ENTITY ns_custom "http://ns.adobe.com/GenericCustomNamespace/1.0/">
    <!ENTITY ns_adobe_xpath "http://ns.adobe.com/XPath/1.0/">
]>
<svg version="1.1" id="Calque_1" xmlns:x="&ns_extend;" xmlns:i="&ns_ai;" xmlns:graph="&ns_graphs;"
 xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px" width="32px" height="32px"
 viewBox="0 0 32 32" enable-background="new 0 0 32 32" xml:space="preserve">
<metadata>
    <sfw  xmlns="&ns_sfw;">
        <slices></slices>
        <sliceSourceBounds  height="21.334" width="32" bottomLeftOrigin="true" y="1.833" x="-4.501"></sliceSourceBounds>
    </sfw>
</metadata>
<path fill="#29ABE2" d="M4,8h24v13.333h2.667v-16H1.334v16h2.667L4,8L4,8z M21.333,22.667c-0.256,0.536-1.527,0.967-2.667,1.181V24
h-5.333v-0.152c-1.14-0.215-2.411-0.645-2.667-1.181H-0.001V24c0,1.467,4,2.667,4,2.667h24c0,0,4-1.2,4-2.667v-1.333H21.333
L21.333,22.667z M26.667,25.333h-1.333V24h1.333V25.333z"/>
</svg>

However, in my case, the DTD entities are irrelevant (I do nothing with them), and only the part from the svg tag interests me. However, if I try to load a such XML, the TXMLDocument parser raises a "Reference to undefined entity 'ns_extend'" exception, and refuse to load the SVG.

So my question is, is there a way to notify the TXMLDocument parser that the DTD entities should be simply ignored if corrupted, and force the parser to continue to read the document silently? Or the only way to do that is to pre-process the XML, and detect and remove such corruptions?

(Note: I want to avoid the pre-prossessing if possible. The SVGs may come from anywhere, some of them may contain small or heavy corruptions, and I want a maximum to be dealt with in the most generic way possible. Starting to add special rules for all possible special cases is a painful way. I would greatly prefer if the TXMLDocument parser is able to ignore this kind of errors.)


Solution

  • With TXMLDocument their is no way to ignore the DOCTYPE, the only way you have is to edit the xml file before to parse it with TXMLDocument and remove from it manually the

    <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
        <!ENTITY ns_extend "http://ns.adobe.com/Extensibility/1.0/">
        <!ENTITY ns_ai "http://ns.adobe.com/AdobeIllustrator/10.0/">
        <!ENTITY ns_graphs "http://ns.adobe.com/Graphs/1.0/">
        <!ENTITY ns_vars "http://ns.adobe.com/Variables/1.0/">
        <!ENTITY ns_imrep "http://ns.adobe.com/ImageReplacement/1.0/">
        <!ENTITY ns_sfw "http://ns.adobe.com/SaveForWeb/1.0/">
        <!ENTITY ns_custom "http://ns.adobe.com/GenericCustomNamespace/1.0/">
        <!ENTITY ns_adobe_xpath "http://ns.adobe.com/XPath/1.0/">
    ]>
    

    however, their is some other xml parser that are fully similar to Txmldocument (same method name and property name, completely similar, no need to redo your code) that work 100x more faster than TXMLDocument and use much less memory (Txmldocument is the worse you can find) .. and that ignore the DTD :)