Search code examples
javaxml-parsingxom

StreamingPathFilter trims spaces


I use the XOM library to parse and process .docx documents. MS Word stores text content in runs (<w:r>) inside the paragraph tags (<w:p>), and often breaks the text into several runs. Sometimes every word and every space between them is in a separate run. When I load a run containing only a space, the parser removes that space and handles it as an empty tag, as a result, the output contains the text without spaces. How could I force the parser to keep all the spaces? I would prefer keeping this parser, but if there is no solution, could you recommend an alternative one?

This is how I call the parser:

StreamingPathFilter filter = new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder = new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...

StreamingTransform contentTransform = new StreamingTransform() {

   @Override
   public Nodes transform(nu.xom.Element node){
      <...process XML and output text...>
   }
}       

Solution

  • Meanwhile, I found the solution to this issue, thanks to the hint of Elliotte Rusty Harold on the XOM mailing list.

    First, the StreamingPathFilter is in fact not part of the nu.xom package, it belongs to nux.xom.

    Second, the issue was caused by StreamingPathFilter. When I changed the code to use the default Builder constructor, the missing spaces appeared in the output.

    Just for documentation, the new code looks like the following:

    Builder builder = new Builder();
    nu.xom.Document doc = builder.build(documentFile);
    context = XPathContext.makeNamespaceContext(doc.getRootElement());
    Nodes nodes = doc.getRootElement().query("w:body/*", context);
    for (int i = 0; i < nodes.size(); i++) {
        transform((nu.xom.Element) nodes.get(i));
    }
    ...
    
    private void transform(nu.xom.Element node){
        //process nodes
        ...
    }