Search code examples
xpathvtd-xml

VTD fails to evaluate a "find all empty nodes with no attributes" xpath


I found a bug (I think) using the 2.13.4 version of vtd-xml. Well, in short I have the following snippet code:

String test = "<catalog><description></description></catalog>";
VTDGen vg = new VTDGen();
vg.setDoc(test.getBytes("UTF-8"));
vg.parse(true);
VTDNav vn = vg.getNav();
//get nodes with no childs, text and attributes
String xpath = "/catalog//*[not(child::node()) and not(child::text()) and count(@*)=0]";
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath(xpath);
//block inside while is never executed
 while(ap.evalXPath()!=-1) {
   System.out.println("current node "+vn.toRawString(vn.getCurrentIndex()));
}

and this doesn't work (=do not find any node, while it should find "description" instead). The code above works if I use the self closed tag:

String test = "<catalog><description/></catalog>";

The point is every xpath evaluator works with both version of the xml. Sadly I receive the xml from an external source, so I have no power over it... Breaking the xpath I noticed that evaluating both

/catalog//*[not(child::node())]

and

/catalog//*[not(child::text())]

give false as result. As additional bit I tried something like:

String xpath = "/catalog/description/text()";
ap.selectXpath(xpath);
if(ap.evalXPath()!=-1)
   System.out.println(vn.toRawString(vn.getCurrentIndex()));

And this print empty space, so in some way VTD "thinks" the node has text, even empty but still, while I expect a no match. Any hint?


Solution

  • TL;DR

    When I faced this issue, I was left mainly with three options (see below). I went for the second option : Use XMLModifier to fix the VTDNav. At the bottom of my answser, you'll find an implementation of this option and a sample output.


    The long story ...

    I faced the same issue. Here are the main three options I first thought of (by order of difficulty) :

    1. Turn empty elements into self closed tags in the XML source.

    This option isn't always possible (like in OP case). Moreover, it may be difficult to "pre-process" the xml before hand.

    2. Use XMLModifier to fix the VTDNav.

    Find the empty elements with an xpath expression, replace them with self closed tags and rebuild the VTDNav.

    2.bis Use XMLModifier#removeToken

    A lower level variant of the preceding solution would consist in looping over the tokens in VTDNav and remove unecessary tokens thanks to XMLModifier#removeToken.

    3. Patch the vtd-xml code directly.

    Taking this path may require more effort and more time. IMO, the optimized vtd-xml code isn't easy to grasp at first sight.


    Option 1 wasn't feasible in my case. I failed implementing Option 2bis. The "unecessary" tokens still remained. I didn't look at Option 3 because I didn't want to fix some (rather complex) third party code.

    I was left with Option 2. Here is an implementation:

    Code

    import java.io.ByteArrayOutputStream;
    import java.io.IOException;
    import java.nio.charset.StandardCharsets;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    import com.ximpleware.AutoPilot;
    import com.ximpleware.NavException;
    import com.ximpleware.VTDException;
    import com.ximpleware.VTDGen;
    import com.ximpleware.VTDNav;
    import com.ximpleware.XMLModifier;
    
    @Test
    public void turnEmptyElementsIntoSelfClosedTags() throws VTDException, IOException {
        // STEP 1 : Load XML into VTDNav
        // * Convert the initial xml code into a byte array
        String xml = "<root><empty-element></empty-element><self-closed/><empty-element2 foo='bar'></empty-element2></root>";
        byte[] ba = xml.getBytes(StandardCharsets.UTF_8);
    
        // * Build VTDNav and dump it to screen
        VTDGen vg = new VTDGen();
        vg.setDoc(ba);
        vg.parse(false); // Use `true' to activate namespace support
    
        VTDNav nav = vg.getNav();
        dump("BEFORE", nav);
    
    
        // STEP 2 : Prepare to fix the VTDNAv
        // * Prepare an autopilot to find empty elements
        AutoPilot ap = new AutoPilot(nav);
        ap.selectXPath("//*[count(child::node())=1][text()='']");
    
        // * Prepare a simple regex matcher to create self closed tags
        Matcher elementReducer = Pattern.compile("^<(.+)></.+>$").matcher("");
    
    
        // STEP 3 : Fix the VTDNAv
        // * Instanciate an XMLModifier on the VTDNav
        XMLModifier xm = new XMLModifier(nav);
        ByteArrayOutputStream baos = new ByteArrayOutputStream(); // baos will hold the elements to fix
        String utf8 = StandardCharsets.UTF_8.name();
    
        // * Find all empty elements and replace them
        while (ap.evalXPath() != -1) {
            nav.dumpFragment(baos);
            String emptyElementXml = baos.toString(utf8);
            String selfClosingTagXml = elementReducer.reset(emptyElementXml).replaceFirst("<$1/>");
    
            xm.remove();
            xm.insertAfterElement(selfClosingTagXml);
    
            baos.reset();
        }
    
        // * Rebuild VTDNav and dump it to screen
        nav = xm.outputAndReparse(); // You MUST call this method to save all your changes
        dump("AFTER", nav);
    }
    
    private void dump(String msg,VTDNav nav) throws NavException, IOException {
        System.out.print(msg + ":\n  ");
        nav.dumpFragment(System.out);
        System.out.print("\n\n");
    }
    

    Output

    BEFORE:
      <root><empty-element></empty-element><self-closed/><empty-element2 foo='bar'></empty-element2></root>
    
    AFTER:
      <root><empty-element/><self-closed/><empty-element2 foo='bar'/></root>