I'm reading an xml document which is encoded in iso-8859-1. This encoding is also described in the documtent:
<?xml version="1.0" encoding="ISO-8859-1"?>
When i read the xml elements, i'm getting the data in utf-8 encoding, but i need iso-8859-1 for further processing.
My code to read the file looks like this:
xmlTextReaderPtr reader;
reader = xmlReaderForFile(sessionFileName, "iso-8859-1", 0);
if (reader != NULL)
{
ret = xmlTextReaderRead(reader);
while (ret == 1)
{
//only inspect start of elements
if (xmlTextReaderNodeType(reader) != XML_READER_TYPE_ELEMENT)
{
ret = xmlTextReaderRead(reader);
continue;
}
//getting node name
elem_name = xmlTextReaderConstName(reader);
//getting content of element (text or cdata)
xmlChar *elem_value = xmlTextReaderReadString(reader);
}
}
As i understand http://xmlsoft.org/encoding.html, libxml2 stores all data internally in utf-8 and so elem_value is also utf-8. How can i get elem_value in iso-8859-1? Do i have convert it manually?
This would be my try:
unsigned char *conv_value;
if (elem_value)
{
int in_size = xmlStrlen(elem_value);
int out_size = in_size;
conv_value = (unsigned char *)malloc((size_t)out_size + 1);
if (UTF8Toisolat1(conv_value, &out_size, elem_value, &in_size) <= 0 ||
(in_size - out_size) != 0)
{
//error while conversation
free(conv_value);
//take original value
conv_value = elem_value;
TRACE("error while converting, take utf-8 value");
}
else
{
conv_value[out_size] = 0; /* null terminating conv_value */
}
}
You're correct, you will need to convert it manually from utf-8 to iso-8859-1 after you get it out of the XML api. I get that this effectively "doubles the work" as it gets converted twice just to end up back at the original encoding, but converting to UTF-8 is an integral part of libxml's parsing process, and there's no way to tell it not to.
The plus side is that if the content you're consuming suddenly changes to UTF-8 or UTF-16 or any other character set, your "get it from libxml and covert to iso-8859-1" code will still work properly.