Search code examples
javaxmlstax

Stax event reader skipping white space


I'm writing a utility to alter text entities within an XML file, using the STAX event model. I've found that the some of the white space in the source document isn't being copied to the output. I wrote this sample program:

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;

import javax.xml.stream.*;
import javax.xml.stream.events.*;

public class EventCopy {
    private static final String INPUT =
            "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
            "<foo><bar>baz</bar></foo>\n";

    public static void main(String[] args) throws XMLStreamException, IOException {
        InputStream reader = new ByteArrayInputStream(INPUT.getBytes(StandardCharsets.UTF_8));
        OutputStream writer = new ByteArrayOutputStream();

        XMLInputFactory input = XMLInputFactory.newInstance();
        XMLEventReader xmlReader = input.createXMLEventReader(reader, "UTF-8");
        try {
            XMLOutputFactory output = XMLOutputFactory.newInstance();
            XMLEventWriter xmlWriter = output.createXMLEventWriter(writer, "UTF-8");
            try {
                while (xmlReader.hasNext()) {
                    XMLEvent event = xmlReader.nextEvent();
                    System.out.print(event.getEventType() + ",");
                    xmlWriter.add(event);
                }
            } finally {
                xmlWriter.close();
            }
        } finally {
            xmlReader.close();
        }
        System.out.println("\n[" + writer.toString() + "]");
    }
}

Using the default Stax implementation that comes with Oracle Java 7, this outputs:

7,1,1,4,2,2,8,
[<?xml version="1.0" encoding="UTF-8"?><foo><bar>baz</bar></foo>]

The newlines following the XML prolog and at the end of the input have disappeared. It seems the reader doesn't even generate events for them.

I thought that maybe the XML reader was leaving the input stream positioned at the end of the last XML tag, and tried adding code to copy trailing characters from the input to the output:

    ...
    } finally {
        xmlReader.close();
    }
    int ii;
    while (-1 != (ii = reader.read())) {
        writer.write(ii);
    }

But this doesn't have any effect.

Is there a way to get STAX to copy this XML more faithfully? Would a different STAX implementation behave differently here?


Solution

  • Reference: XML spec

    A well-formed XML document follows the specification grammar:

    [1]  document ::= prolog element Misc*
    [22] prolog   ::= XMLDecl? Misc* (doctypedecl Misc*)?
    [23] XMLDecl  ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
    [27] Misc     ::= Comment | PI | S
    [3]  S        ::=   (#x20 | #x9 | #xD | #xA)+
    
    [39] element  ::= EmptyElemTag
                      | STag content ETag
    [40] STag     ::= '<' Name (S Attribute)* S? '>'
    [43] content  ::= CharData? ((element | Reference | CDSect | PI | Comment) CharData?)*
    [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
    [42] ETag     ::= '</' Name S? '>'
    

    The line feed between XMLDecl and the root element, and the one after the root element, are just S that the parser allows itself to ignore.

    Let me give an example of a different white space. Suppose you have a slightly different XML:

    private static final String INPUT =
            "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
            "<foo>\n<bar>baz</bar></foo>\n";
    

    The line feed between <foo> and <bar> is a CharData. Note that StAX will properly generate an event for this character.

    If you really want to preserve S, then you'll need to read INPUT as text instead of as an XML document. Note that two XML document instances, one with these two specific S characters and one without them, are equivalent.