Search code examples
xmlqtqt5dtdqtxml

Manually resolve internal XML entities


I am using QXmlSimpleReader to parse an XML file with internally defined entities in it, e.g.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ELEMENT root (#PCDATA)>
<!ENTITY ent "some internally defined entity">
]>
<root>
text &ent; text
</root>

I am handling the document with a QXmlDefaultHandler subclass and the most I can do about internal entities is to have their usage reported.

By default all internally defined entities (&ent; in the example above) are substituted into characters automatically. How can I change this behavior, so that I can specify to what string should they be replaced? I am also fine with switching to another Qt's XML reader if that is required to make it work.


Solution

  • I found one way to do it, although it is more of a hack then a proper solution, since it doesn't stop Qt from actually replacing the entity characters with resolved ones. It's just a workaround where those characters are ignored.

    First, make the QXmlSimpleReader report entities by setting the appropriate feature and handle content and lexical info:

    QXmlSimpleReader xmlReader;
    xmlReader.setFeature("http://qt-project.org/xml/features/report-start-end-entity", true);
    xmlReader.setContentHandler(handler);
    xmlReader.setLexicalHandler(handler);
    

    Next, in the handler above, override bool QXmlLexicalHandler::startEntity(const QString &name) and bool QXmlLexicalHandler::endEntity(const QString &name) and keep inside a state whether the reader is currently reading an entity. When it is, just ignore input from bool QXmlContentHandler::characters(const QString &ch) and instead just handle the resolution in startEntity or endEntity.