Search code examples
c++xml-parsingxercesxerces-c

XercesDOMParser fails if the attribute value contains ampersand


I have a utility class for basic xml/html edit operations. This uses xerces cpp library version 3.2.2.

XMLPlatformUtils::Initialize();
m_xmlParser = new XercesDOMParser();
m_xmlParser->setValidationScheme(XercesDOMParser::Val_Always);
m_xmlParser->setDoNamespaces(false);
m_xmlParser->setDoSchema(false);
m_xmlParser->setLoadExternalDTD(false);

// string validText = "<a href=\"http://localhost/page=PDF\">text</a>";
string invalidText = "<a href=\"http://localhost/page=PDF&pageId=23\">text</a>";
MemBufInputSource buffer((const XMLByte*)invalidText.c_str(), invalidText.size(), "In-memory XML buffer");
m_xmlParser->parse(buffer);

xercesc::DOMDocument* xmlDoc = m_xmlParser->getDocument();
if (!xmlDoc) {
    m_logger->warn(__FILE__, __LINE__, -1, "Empty xml document");
    return;
}
m_logger->debug("Getting the root element from input xml file");
m_rootElement = xmlDoc->getDocumentElement();
if (m_rootElement == nullptr) {
    m_logger->warn(__FILE__, __LINE__, -1, "Not a valid xml element");
}

m_rootElement is null with this code. But when I switch the input to validText, it works correctly and so does replacing & with &amp; in the input. But both these changes break the hyperlink.


Solution

  • XML requires you to escape the ampersand as &amp; but of course any HTML user agent, when following the link, would work on the attribute value with a simple ampersand e.g. http://localhost/page=PDF&pageId=23 and not the lexical XML and the link would work. So Xerces as an XML parser is doing the right thing to reject the unescaped ampersand in the attribute value. It is not clear what software you use to follow the hyperlink.

    In the DOM, if you use e.g. getAttribute("href"), you should get the unescaped value http://localhost/page=PDF&pageId=23, and that or a similar method is what I would expect an HTML user agent to use to read out an href attribute value to e.g. follow a link.