Search code examples
javaparsingsax

unexpected behaviour parsing xml with SAXParser


I am simply reading an xml and writing back the xml:

<p>Il <b>1888</b> (MDCCCLXXXVIII in numeri romani) è un anno bisestile del XIX secolo.</p>

The result is:

<p>Il<b>1888</b> (MDCCCLXXXVIII in numeri romani) è un anno bisestile del XIX secolo.</p>

As you can see I have lost a space.

Can someone explain me why, or how can I prevent this ?

My code:

 package parsing;

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;

import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;

public class TextCase {

    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        String text = "<p>Il <b>1888</b> (MDCCCLXXXVIII in numeri romani) è un anno bisestile del XIX secolo.</p>";
        String newString = readSave(text);
        System.out.println(newString);

    }

    public static String readSave(String text) throws Exception {


        InputStream is = new ByteArrayInputStream((text).getBytes(StandardCharsets.UTF_8.name()));
        SAXBuilder saxBuilder = new SAXBuilder();
        Document document = saxBuilder.build(is);
        Element classElement = document.getRootElement();

        //processElement(classElement, months, monthIndex);

        XMLOutputter outputter = new XMLOutputter(Format.getCompactFormat().setOmitDeclaration(true));
        String output = outputter.outputString(classElement);

        return output;
    }
}

Solution

  • You need to use Format.getRawFormat() instead of Format.getCompactFormat()

    Format.getCompactFormat()

    <p>Il<b>1888</b>(MDCCCLXXXVIII in numeri romani) è un anno bisestile del XIX secolo.</p>
    

    Format.getPrettyFormat()

    <p>
      Il
      <b>1888</b>
      (MDCCCLXXXVIII in numeri romani) è un anno bisestile del XIX secolo.
    </p>
    

    Format.getRawFormat()

    <p>Il <b>1888</b> (MDCCCLXXXVIII in numeri romani) è un anno bisestile del XIX secolo.</p>