Search code examples
pythonxmlsaxdoctype

How can I use defusedxml with sax in Python3?


I've built an XML parser in Python3 that uses SAX to extract useful information out of a long (potentially streaming) file; I'll put what I believe to be the relevant parts of my existing code down at the bottom of this post. I'm testing my parser on PubMed's XML data, which presumably is safe--but the parser may get used on other XML data (with suitable modifications to the tags it looks for), and that XML may not be safe.

It looks like I should be using the defusedxml library for safety. The description is that this is a "monkey patch", which IIUC means that where the defusedxml.sax library doesn't provide functionality, I can (safely, I hope!) use the regular xml.sax library. An example of the need for this is my element handler, which has to be defined to use the xml.sax library, not the defusedxml.sax library, since the latter doesn't supply a 'handler' class that I can subclass:

class ElementHandler(xml.sax.handler.ContentHandler):

because the defusedxml.sax library does not provide a 'handler'. On the other hand, the defusedxml.sax library does provide a definition of make_parser(), which I use.

But when I try to run my code (which works fine, if unsafely, when I just use the standard xml.sax library), I get an exception:

raise ExternalReferenceForbidden(context, base, sysid, pubid)
        defusedxml.common.ExternalReferenceForbidden: 
     ExternalReferenceForbidden(system_id='http://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd', 
                                public_id=None)

which evidently happens when my parser reads the second line of the PubMed XML file, to whit:

<!DOCTYPE PubmedArticleSet SYSTEM "http://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">

Ok, so how to fix this? Intuitively, I want it to ignore this DOCTYPE declaration. But how?

The documentation says All functions and parser classes accept three additional keyword arguments. They return either the same objects as the original functions or compatible subclasses.

   forbid_dtd (default: False)
   disallow XML with a <!DOCTYPE> processing instruction and 
   raise a DTDForbidden exception when a DTD processing instruction 
   is found.

But I have two problems: 1) I don't want it to raise an exception, I just want it to ignore the declaration; and 2) I can't figure out where to put this keyword. If I put it in my call to make_parser():

   defusedxml.sax.make_parser(forbid_dtd=True)

I get

   TypeError: make_parser() got an unexpected keyword argument 'forbid_dtd'

and similarly everywhere else I've tried putting it.

I've looked for sample code, but I haven't found anything useful, nor any questions here that address the issue. There's this, but it's a kluge--re-write the incoming XML files without the DOCTYPE declaration, and then (I suppose) parse the new file with SAX. Not very practical when dealing with large XML documents.

So my question is: How do I build a SAX parser using the defusedxml library, and tell it to ignore DOCTYPE declarations?

---------excerpts from my code follow-----------

def ProcessXMLFile(<some parameters here>
    SAXParser = defusedxml.sax.make_parser()
    SAXParser.setContentHandler(ElementHandler(<my startup parameters here>)
    ContentHandler = SAXParser.getContentHandler()
    Input = xml.sax.InputSource()
    Input.setCharacterStream(strXMLFile)
    Input.setEncoding('utf-8')
    SAXParser.parse(Input.getCharacterStream())

class ElementHandler(xml.sax.handler.ContentHandler):
    <__init__(), startElement(), endElement() etc. here>

Solution

  • You would set forbid_dtd with:

    parser =  defusedxml.sax.make_parser()
    parser.forbid_dtd = True
    

    But as you suggest, it wouldn't do what you want. What you would want here is to disable the forbid_external setting:

    parser =  defusedxml.sax.make_parser()
    parser.forbid_external = False
    

    which is the one that is raising ExternalReferenceForbidden