I'm trying to read in an xml file which looks like this
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<incollection>
<author>José A. Blakeley</author>
</incollection>
</dblp>
The point that creates the problem looks is the
José A. Blakeley
part: The parser calls its character handler twice, once with "Jos", once with " A. Blakeley". Now I understand this may be the correct behaviour if it doesn't know the eacute entity. However, this is defined in the dblp.dtd, which I have. I don't seem to be able to convince expat to use this file, though. All I can say is
p = xml.parsers.expat.ParserCreate()
# tried with and without following line
p.SetParamEntityParsing(xml.parsers.expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
p.UseForeignDTD(True)
f = open(dblp_file, "r")
p.ParseFile(f)
but expat still doesn't recognize my entity. Why is there no way to tell expat which DTD to use? I've tried
What am I missing? Thx.
As I understand it, if you're using pyexpat directly, then you have to provide your own ExternalEntityRefHandler
to fetch the external DTD and feed it to expat.
See eg. xml.sax.expatreader
for example code (method external_entity_ref
, line 374 in Python 2.6).
It would probably be better to use a higher-level interface such as SAX (via expatreader
) if you can.