Search code examples
clibxml2

How to remove '&'-words encoding from libxml2?


I have an XML file which should be parsed and processed. For that reason I'm using libxml2.

The xml file I have looks something like this:

test.xml

<root>
     <tag attr1="VALUE_1 &quot;" attr2="VALUE_2 &#xA; VALUE_3" />
</root>

And I want to get the attribute contents. BUT the libxml2 seems to encode the '&'-words (don't know how to call them).

The code I use is the following one:

LIBXML_TEST_VERSION

xmlDoc *doc;
doc = xmlReadFile("test.xml", NULL, XML_PARSE_IGNORE_ENC);

xmlNode *root;
root = xmlDocGetRootElement(doc);

xmlNode *node;
node = root->children;

while (node != NULL) {
        if (node->type == XML_ELEMENT_NODE) {

                xmlAttr *attr;
                attr = node->properties;

                while (attr != NULL) {
                        xmlNode *child;
                        child = attr->children;

                        while (child != NULL) {
                                if (child->type == XML_TEXT_NODE || 
                                    child->type == XML_CDATA_SECTION_NODE) 
                                        printf("%s\n", child->content);

                                child = child->next;
                        }

                        attr = attr->next;
                }
        }       

        node = node->next;
}

So basically I want to print the attribute values, BUT they are being parsed with a formatting (I guess). When I run this code than I see following output:

VALUE_1 "

VALUE_2 
 VALUE_3

As you can see it translated the '&'-words. How can I hint the libxml2 to not do that and give me the literal text values.


Solution

  • You simply can't. libxml2 will always decode numeric character references like &#xA; and predefined entities like &quot;. But &#65; and A, for example, are semantically equivalent. If you really need to tell them apart, you're probably doing something wrong elsewhere in your XML pipeline. If you want a literal &#xA; in an attribute value, you have to encode it as &amp;#xA;.

    Note that the expansion can be controlled for other, user-defined entities via the XML_PARSE_NOENT parser flag, but this won't affect numeric character references.