Search code examples
javaxmlutf-8jdom

How to change output coding of a XML of reading and writing at the same time in with Jdom?


I have this code, I want to read and want to write the "prueba3.xml" at the same time, the file is UTF8 but when I write the file, the encoding changes and displays strange characters, although I have added format.setEncoding("UTF-8"), it is not doing it correctly. Is it possible to change the output encoding to UTF8 with jdom SAXBuilder?

Input XML:

<?xml version="1.0" encoding="UTF-8"?>
<prueba>
    <reg id="576340">
         <dato cant="856" id="6" val="-1" num="" desc="ñápás" />
         <dato cant="680" id="1" val="-1" num="" desc="résd" />
         <dato cant="684" id="5" val="-1" num="" desc="..да и вообем" />
         <dato cant="1621" id="1" val="-1" num="" desc="hi" />
         <dato cant="1625" id="5" val="-1" num="" desc="Hola" />
   </reg>
</prueba>

This is the code:

public static void main(String[] args) throws FileNotFoundException, JDOMException, IOException
{
    //Se crea un SAXBuilder para poder parsear el archivo
    File xml = new File("c:\\prueba3.xml");
    Document doc = (Document) new SAXBuilder().build(xml);

    Element raiz = doc.getRootElement();
    //Recorremos los hijos de la etiqueta raíz  
    List articleRow = raiz.getChildren("reg");

    for (int i = 0; i < articleRow.size(); i++) {

        Element row = (Element) articleRow.get(i);
        List images = row.getChildren("dato");

         for (int j = 0; j < images.size(); j++) {

             Element row2 = (Element) images.get(j);
             String texto = row2.getAttributeValue("desc") ;
             String id = row2.getAttributeValue("id"); 

                   if ((texto != null) && (texto !="") && (id.equals("1"))){
                     row2.getAttribute("desc").setValue("Raúl").toString();
                   }
        }

        Format format = Format.getRawFormat();
        format.setEncoding("UTF-8");
        XMLOutputter xmlOutput = new XMLOutputter(format);
        xmlOutput = new XMLOutputter(format);
        xmlOutput.output(doc, new FileWriter("c:\\prueba3.xml"));
    }

    System.out.println("fin");   
}

Output XML:

<?xml version="1.0" encoding="UTF-8"?>
<prueba>
  <reg id="576340">
       <dato cant="856" id="6" val="-1" num="" desc="񡰡s" /> 
       <dato cant="680" id="1" val="-1" num="" desc="Ra򬢠/>
       <dato cant="684" id="5" val="-1" num="" desc="..?? ? ??????" />
       <dato cant="1621" id="1" val="-1" num="" desc="Ra򬢠/>
       <dato cant="1625" id="5" val="-1" num="" desc="Hola" />
 </reg>
</prueba>

Greetings and thanks for your time.


Solution

  • This is a relatively common problem to encounter when using JDOM - especially in countries/regions with non-latin alphabets. In some senses I regret maintaining the use of Writer outputs at all in JDOM.

    See the JavaDoc on XMLOutputter too: http://www.jdom.org/docs/apidocs/org/jdom2/output/XMLOutputter.html

    The issue is that FileWriter uses the default encoding of the system to convert from the Writer to the underlying byte data. JDOM cannot control that conversion.

    If you change the line of code:

    xmlOutput.output(doc, new FileWriter("c:\\prueba3.xml"));
    

    to use an OutputStream instead of a Writer:

    try (OutputStream fos = new FileOutputStream("c:\\prueba3.xml")) {
        xmlOutput.output(doc, fos);
    }
    

    ... it will use the output as a byte-stream, and the systems' default encoding won't interfere with the output.

    (P.S. There's no reason to assign the xmlOutput instance twice.)