Search code examples
encodinglibxml2

LibXML xmlTextReaderReadString encoding


I'm reading an xml document which is encoded in iso-8859-1. This encoding is also described in the documtent: <?xml version="1.0" encoding="ISO-8859-1"?>

When i read the xml elements, i'm getting the data in utf-8 encoding, but i need iso-8859-1 for further processing.

My code to read the file looks like this:

xmlTextReaderPtr reader;
reader = xmlReaderForFile(sessionFileName, "iso-8859-1", 0);
if (reader != NULL)
{
    ret = xmlTextReaderRead(reader);

    while (ret == 1)
    {
        //only inspect start of elements
        if (xmlTextReaderNodeType(reader) != XML_READER_TYPE_ELEMENT)
        {
            ret = xmlTextReaderRead(reader);
            continue;
        }

        //getting node name
        elem_name = xmlTextReaderConstName(reader);

        //getting content of element (text or cdata)
        xmlChar *elem_value = xmlTextReaderReadString(reader);
    }
}

As i understand http://xmlsoft.org/encoding.html, libxml2 stores all data internally in utf-8 and so elem_value is also utf-8. How can i get elem_value in iso-8859-1? Do i have convert it manually?

This would be my try:

        unsigned char *conv_value;

        if (elem_value)
        {
            int in_size = xmlStrlen(elem_value);
            int out_size  = in_size;

            conv_value = (unsigned char *)malloc((size_t)out_size + 1);

            if (UTF8Toisolat1(conv_value, &out_size, elem_value, &in_size) <= 0 ||
                (in_size - out_size) != 0)
            {
                //error while conversation
                free(conv_value);

                //take original value
                conv_value = elem_value;

                TRACE("error while converting, take utf-8 value");
            }
            else
            {
                conv_value[out_size] = 0; /* null terminating conv_value */
            }
        }

Solution

  • You're correct, you will need to convert it manually from utf-8 to iso-8859-1 after you get it out of the XML api. I get that this effectively "doubles the work" as it gets converted twice just to end up back at the original encoding, but converting to UTF-8 is an integral part of libxml's parsing process, and there's no way to tell it not to.

    The plus side is that if the content you're consuming suddenly changes to UTF-8 or UTF-16 or any other character set, your "get it from libxml and covert to iso-8859-1" code will still work properly.