Search code examples
c++xmlpdfpugixml

Read German text from XML and write to a PDF


I have an XML (in UTF-8). I have to read a value of a std::string variable from it using PugiXML libraries. After reading the value, I am printing it on console but in my actual project, I have to put that value to a PDF (using LibHaru libraries). My MWE is following:

#include <iostream>
#include "pugiconfig.hpp"
#include "pugixml.hpp"

using namespace pugi;

int main()
{   
    pugi::xml_document doc;
    pugi::xml_parse_result result = doc.load_file(FILEPATH);

    xml_node root_node = doc.child("Report");
    xml_node SystemName_node = root_node.child("SystemName");

    std::string strSystemName = SystemName_node.child_value();

    std::cout<<" The name of the system is: "<<strSystemName<<std::endl;

    return 0;
}

I am reading the value of a variable std::string strSystemName from a XML file using Pugixml libraries. After reading the variable I am printing it on screen (in my actual project, I am writing it to a pdf file). Problem: During debugging, I found that the strange characters have been read from the XML file (which is already in UTF-8), which appears if I print the variable on screen or put it to the pdf.

IMPORTANT: Printing to console is not too important. Important is to put it properly to the PDF file which is also using UTF-8 encoding. But I think that storing the variable in std::string is somehow creating problem and therefore the wrone value is passed to the PDF writer.

PS: I am using VS2010 which is without C++11.


Solution

  • The problem here is that std::cout is just reflecting the UTF-8 bytes in the string to the console. Normally on Windows, the console is not running in UTF-8, but in (for example) code page 1252, so the two bytes of a UTF-8 'ä` get displayed as two characters.

    Your solution is either to convert the console to UTF-8 (see this answer), or to convert your UTF-8 string into a CP-1252 string. I think this is going to require MultiByteToWideChar (specifying UTF-8) + WideCharToMultiByte (specifying CP-1252)

    To debug your actual problem (passing UTF-8 strings into pugixml), you need to look at the actual bytes in the strings, and check they are what you think they are.