Search code examples
javaxmlstructureflat

Java XML: Convert Depth Aligned Structure to Breadth-Aligned Structure


this question refers to an algorithm we have not been able to find it. The problem is the following:

We have an XML (to be specific an ODT File containing a content.xml) containing a depth-aligned structure:

<root xmlns:text="someuri">
<header>
...
</header>
<body>
    <text:span dns:att01="value">
        some text
    </text:span>
    <text:span dns:att02="value">
        more text
        <text:span dns:att03="value">
            even nested structures
        </text:span>
    </text:span>
</body>
</root>

Please note, that this is a simplified example containing only the necessary details. As you can see, this looks like a 'normal' xml structure with a root containing some text and span nodes. For our application we need to do some processing. As all the span nodes contain other nodes, forming a tree like structure, the target format needs to be converted to have the text nodes breadth-aligned. This is the desired format:

<root xmlns:text="someuri">
<header>
...
</header>
<body>
    <text:marker-begin text:name="01" />
        some text
    <text:marker-end text:name="01" />
    <text:marker text:name="01" />

    <text:marker-begin text:name="02" />
        more text
        <text:marker-begin text:name="03" />
            even nested structures
        <text:marker-end text:name="03" />
        <text:marker text:name="03" />
    <text:marker-end text:name="02" />
    <text:marker text:name="02" />

</body>
</root>

Don't let the indentation irritate you, that all text nodes might have a direct parent except the body node. The marker is used to trigger a certain feature from a third party software. The desired text notes are now surounded by empty elements implicating a marking mechanism. Now, after some verbose preparation, the question itself:

How would you transform structure one into structure two using the default DOM mechanisms available via java. Is this even possible? Would you rather suggest a SAX approach to collect the start and ending elements of a span node? Does an algorithm already exists for this problem? XLST is not possible due to a side-processing chain, that must be done during the process.


Solution

  • We found a solution by using a dirty trick:

    we have a breadth-first implementation of a traverser (using the TreeWalker doesn't make any sense here) delegate the desired operation to a processing function:

    // local field
    Queue queue;
    
    void traverse()
    {
        queue = new LinkedListed();
        queue.add(documentRoot);
    
        queue.add(root);
        while (!queue.isEmpty()) 
        {
            current     = queue.poll();
            children    = current.getChildNodes();
    
            // the delegate
            process(current);
    
            for (int i = 0; i < children.getLength(); i++) 
            {
                child = children.item(i);
                switch(child.getNodeType())
                {
                case Node.ELEMENT_NODE:
                case Node.TEXT_NODE:
                    queue.add(child);
                    break;
                }
            } // end iteration over childnodes
        }
    }
    

    and here is the processing function:

    void process(Node node)
    {
                String name                     = node.getNodeName();
            Map<String, String> attributes      = XMLUtil.extractAttributes(node);
    
            // this is basically the current node, but we need to save it as
            // extra reference to copy all child elements from it, to further process
            // the document tree
            Node target = null;
            Node next   = null;
            Node parent = node.getParentNode();
    
            if(name.equals("text:" + TARGET_ELEMENT)) {
                // deep copy
                target = node.cloneNode(true);
    
                // create the three relevant bookmark nodes
                Node bkMrkStart = document.createElement("bookmark-begin");
                Node bkMrkEnd   = document.createElement("bookmark-end");
                Node bkMrkRsd   = document.createElement("bookmark");
    
                // insert bookmark start
                node.getParentNode().insertBefore(bkMrkStart, node);
    
                // get next sibling or null, if last elment
                next = node.getNextSibling();
    
                // insert ending bookmark and 'residue'
                parent.insertBefore(bkMrkRsd, next);
                parent.insertBefore(bkMrkEnd, bkMrkRsd);
    
                // create new its tag element
                AuxiliaryElement nextAux = createAuxiliary(attributes);
    
    
                // apply generated id to created bookmarks
                XMLUtil.setAttribute(
                        "text:span", 
                        "id-[" + nextAux.getId().getString() + "]", 
                        bkMrkStart, bkMrkEnd, bkMrkRsd);
    
    
                NodeList children = target.getChildNodes();
    
                int index = 0;
                do
                {
                    Node child = children.item(index).cloneNode(true);
                    // it seems necessary to extra save this element
                                // for further processing
                    queue.add(child);
    
                    parent.insertBefore(child, bkMrkEnd);
                } while(++index < children.getLength());
    
                // delete old element
                parent.removeChild(node);
    
                // clear target
                target = null;
            }
    }
    

    It looks like #removeChild or #insertBefore is not being reflected by the traversal. It might be, that this is due to our own implementation of a breadth first traverser. However using this approach as described above gives the desired result.