I have an XML file which should be parsed and processed. For that reason I'm using libxml2
.
The xml file I have looks something like this:
test.xml
<root>
<tag attr1="VALUE_1 "" attr2="VALUE_2 
 VALUE_3" />
</root>
And I want to get the attribute contents. BUT the libxml2 seems to encode the '&'-words (don't know how to call them).
The code I use is the following one:
LIBXML_TEST_VERSION
xmlDoc *doc;
doc = xmlReadFile("test.xml", NULL, XML_PARSE_IGNORE_ENC);
xmlNode *root;
root = xmlDocGetRootElement(doc);
xmlNode *node;
node = root->children;
while (node != NULL) {
if (node->type == XML_ELEMENT_NODE) {
xmlAttr *attr;
attr = node->properties;
while (attr != NULL) {
xmlNode *child;
child = attr->children;
while (child != NULL) {
if (child->type == XML_TEXT_NODE ||
child->type == XML_CDATA_SECTION_NODE)
printf("%s\n", child->content);
child = child->next;
}
attr = attr->next;
}
}
node = node->next;
}
So basically I want to print the attribute values, BUT they are being parsed with a formatting (I guess). When I run this code than I see following output:
VALUE_1 "
VALUE_2
VALUE_3
As you can see it translated the '&'-words. How can I hint the libxml2 to not do that and give me the literal text values.
You simply can't. libxml2 will always decode numeric character references like 

and predefined entities like "
. But A
and A
, for example, are semantically equivalent. If you really need to tell them apart, you're probably doing something wrong elsewhere in your XML pipeline. If you want a literal 

in an attribute value, you have to encode it as &#xA;
.
Note that the expansion can be controlled for other, user-defined entities via the XML_PARSE_NOENT
parser flag, but this won't affect numeric character references.