Search code examples
c++tinyxml

TinyXML - any way to skip problematic DOCTYPE tag?


I am using TinyXML2 to parse an XML that looks somewhat like:

<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE comp PUBLIC "-//JWS//DTD xyz//EN" "file:/documentum/xyz.dtd"
[<!ENTITY subject SYSTEM "dctm://he/abc">
]>
<comp>
...
</comp>

Unfortunately, as per http://www.grinninglizard.com/tinyxmldocs/, it looks like TinyXML doesn't support parsing DOCTYPE tags such as the one in the above sample. I am not interested in the DTD per se and would only like to parse the rest of the XML (starting with <comp> tag). What is the recommended or best way to achieve this? I tried retrieving the XML subtree rooted at <comp> (using document.FirstChildElement("comp")) but this approach failed, possibly because TinyXML is unable to continue parsing beyond the <!ENTITY tag which it seems to consider to be an error. Any ideas on how this can be achieved using TinyXML itself (i.e. preferably without requiring a preprocessing step that removes the <!DOCTYPE ..> using regular expression matching before invoking TinyXML)?


Solution

  • You can first load the entire file into an std::string, skip the unsupported statements and then parse the resulting document, like this:

    // Open the file and read it into a vector
    std::ifstream ifs("filename.xml", std::ios::in | std::ios::binary | std::ios::ate);
    std::ifstream::pos_type fsize = ifs.tellg();
    ifs.seekg(0, ios::beg);
    std::vector<char> bytes(fsize);
    ifs.read(&bytes[0], fsize);
    
    // Create string from vector
    std::string xml_str(&bytes[0], fsize);
    
    // Skip unsupported statements
    size_t pos = 0;
    while (true) {
        pos = xml_str.find_first_of("<", pos);
        if (xml_str[pos + 1] == '?' || // <?xml...
            xml_str[pos + 1] == '!') { // <!DOCTYPE... or [<!ENTITY...
            // Skip this line
            pos = xml_str.find_first_of("\n", pos);
        } else
            break;
    }
    xml_str = xml_str.substr(pos);
    
    // Parse document as usual
    TiXmlDocument doc;
    doc.Parse(xml_str.c_str());
    

    Additional note: if the XML file is too large, it's better to use memory mapped files instead of loading the entire file into memory. But that's another question entirely.