How can I resolve External Unparsed Entity during parsing with lxml?
Here is my code example:
import io
from lxml import etree
content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg">
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
"""
parser = etree.XMLParser(dtd_validation=True, resolve_entities=True)
doc = etree.parse(io.BytesIO(content), parser=parser)
print(etree.tostring(doc))
Note: I'm using lxml >= 3.4
Currently I have the following result:
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg" >
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
Here, the ref1
entity isn't resolved to "python-logo-small.jpg".
I expected to have <sample src="python-logo-small.jpg"/>
.
Is there something wrong?
I also try with:
parser = etree.XMLParser(dtd_validation=True, resolve_entities=True, load_dtd=True)
But I have the same result.
Alternatively, I'd like to resole the entities myself. To do that, I try to list the entities that way:
for entity in doc.docinfo.internalDTD.iterentities():
msg_fmt = "{entity.name!r}, {entity.content!r}, {entity.orig!r}"
print(msg_fmt.format(entity=entity))
But I only get the entity's and the notation's names, not the entity's definition:
'ref1', 'jpeg', None
How to access to the entity's definition?
OK, it's impossible to "resolve" external unparsed entities, but we can list them:
import io
import xml.sax
content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg">
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
"""
class MyDTDHandler(xml.sax.handler.DTDHandler):
def __init__(self):
pass
def unparsedEntityDecl(self, name, publicId, systemId, ndata):
print(dict(name=name, publicId=publicId, systemId=systemId, ndata=ndata))
xml.sax.handler.DTDHandler.unparsedEntityDecl(self, name, publicId, systemId, ndata)
parser = xml.sax.make_parser()
parser.setDTDHandler(MyDTDHandler())
parser.parse(io.BytesIO(content))
The result is:
{'systemId': u'python-logo-small.jpg', 'ndata': u'jpeg', 'publicId': None, 'name': u'ref1'}
So the work is done.