Search code examples
pythonxmllxmldtd

XML uses an external DTD for validation - XML parser is Python (lxml) and this parser cannot load the external DTD from the HTTPS side


I have another problem I'm desperate about. I think there are many solutions to this problem, but I would like to know if my approach can be implemented somehow.

I have a XML file uses one external DTD and is defined with the XML DOCTYP.

The xml-file are parsed with Python (lxml). So it is possible to validate the different files automatically with the DTD's defined in the XML DOCTYP. I use an external DTD which can be accessed via internet address. But this internet site redirects every request to the HTTPS port. For this reason Python cannot access the external DTD.

Thanks to a friend of mine I was able to use an old, unused website that still runs on HTTP. The DTD on this stored website can be found and used by the parser.

Now for my question. Is it possible to use an external DTD with Python-lxml that is only accessible via a HTTPS server? Unfortunately I have no possibility to create an area on the server that uses the HTTP port.

I've already tried to get the external DTD via an HTTP request but it gets redirected to the HTTPS port.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE book PUBLIC "-//AA//Test//EN" "***">
<!-- <!DOCTYPE book PUBLIC "-//AA//Test//EN" "***"> -->
<book>
    <book-meta>
        <book-id pub-id-type="other">handbook</book-id>
        <book-title-group Id="1">
            <book-title name="Hallo">The NCBI Handbook</book-title>
        </book-title-group>
    </book-meta>
</book>

For completeness here is an example DTD.

<!ELEMENT book ANY>
<!ATTLIST book
      Release                       CDATA "v0.0.1"
>

<!ELEMENT book-meta ANY> <!-- # related objects: 0 -->
<!ATTLIST book-meta
       Value                        CDATA "Das ist eine Information"
>
<!ELEMENT book-id ANY> <!-- # related objects: 0 -->
<!ATTLIST book-id
       pub-id-type                      CDATA #REQUIRED
>
<!ELEMENT book-title-group ANY> <!-- # related objects: 0 -->
<!ATTLIST book-title-group
         Id                                         CDATA #IMPLIED 
>
<!ELEMENT book-title ANY> <!-- # related objects: 0 -->
<!ATTLIST book-title
      name CDATA #REQUIRED
>

For parsing the XML files I use a python script with the library lxml. Following is the test program.

import xml.etree.ElementTree as ET 
import lxml
from lxml import etree  

myParser = lxml.etree.XMLParser(attribute_defaults  = True, dtd_validation = True, load_dtd =True, no_network = False)
xmlFile  = lxml.etree.parse("XML_DTDValidation.xml", parser=myParser)
xmlFile.xinclude()
xmlFile.write("XML_DTDValidation_out.xml",method="xml",xml_declaration=True, encoding='utf-8',pretty_print=True)

I hope I could summarize my problem well and someone can help me.


Solution

  • This page describes some ways to work around this.

    You can either:

    • set up an XML catalog (which you could use to store the DTD somewhere local)
    • create your own resolver class which either redirects the URL, or retrieves the DTD from somewhere else.