I have HTML that I'm trying to generate an XML document from. I want to skip certain elements (basically all but my divs) and for this purpose, I've written a simple DOM traversal function, but I seem to be getting stuck in an infinite loop. (More details below.)
<div id="browserDiv">
<h3>Library</h3>
<ul>
<li>
<div id="t-0" class="section topic" data-content="2b-2t-38-w-2c-2w-2t-33-36-3d">
<p>Set Theory</p>
<img class="toggle"><img class="edit">
<img class="add-entry"><img class="delete">
<ul>
<li>
<div id="t-0-0" class="section topic" data-content="1t-3c-2x-33-31-37">
<p>Axioms</p>
<img class="toggle"><img class="edit">
<img class="add-entry"><img class="delete">
<ul>
<li>
<div id="t-0-0-0" class="section topic" data-content="1t-3c-2x-33-31-w-33-2u-w-2b-2t-34-2p-36-2p-38-2x-33-32">
<p>Axiom of Separation</p>
<img class="toggle"><img class="edit">
<img class="add-entry"><img class="delete">
<ul>
<li>
<img class="add-new">
</li>
</ul>
</li>
<li>
<img class="add-new">
</li>
</div>
</li>
<li>
<img class="add-new">
</li>
</ul>
</div>
</li>
<li>
<div id="t-1" class="section topic" data-content="1t-32-2p-30-3d-37-2x-37">
<p>Analysis</p>
<img class="toggle"><img class="edit">
<img class="add-entry"><img class="delete">
<ul>
<li>
<img class="add-new">
</li>
</ul>
</div>
</li>
<li>
<img class="add-new">
</li>
</ul>
</div>
And I'm trying to convert this html into an XML file. But the XML only stores info contained in the div
elements, so I'm trying to skip over all the other elements when I iterate through the DOM tree.
<?xml version="1.0" encoding="UTF-8"?>
<library userid="095209376">
<title>UserID095209376's Library</title>
<topic children="yes" loadable="no">
<id>0</id>
<encoding>2b-2t-38-w-2c-2w-2t-33-36-3d</encoding>
<topic children="yes" loadable="no">
<id>0-0</id>
<encoding>1t-3c-2x-33-31-37</encoding>
<topic children="no" loadable="yes">
<id>0-0-0</id>
<encoding>1t-3c-2x-33-31-w-33-2u-w-2b-2t-34-2p-36-2p-38-2x-33-32</encoding>
</topic>
</topic>
<topic children="yes" loadable="no">
<id>1</id>
<encoding>1t-32-2p-30-3d-37-2x-37</encoding>
</topic>
</library>
(Note that the script tags are only there to get SO to do syntax highlighting.)
<script>
function saveLibrary(){
var xmlDoc = document.implementation.createDocument('http://www.tuningcode.com', 'library');
var rootNode = document.getElementById('browserDiv');
console.log("rootNode here: " + rootNode);
var libraryTree = walkLibraryTree2(rootNode, xmlDoc);
xmlDoc.documentElement.appendChild(libraryTree);
var oSerializer = new XMLSerializer();
var sXML = oSerializer.serializeToString(xmlDoc);
console.log("xmlDoc: " + xmlDoc);
console.log(sXML);
}
function walkLibraryTree2(nodeToWalk, doc){
var elem = doc.createElement(nodeToWalk.tagName);
console.log(elem);
if(nodeToWalk.hasChildNodes()){
var ch = nodeToWalk.children;
for(var i = 0; i < ch.length; i++){
var theWalk = walkLibraryTree2(ch[i], doc);
if(theWalk != null){
if(ch[i].tagName == 'DIV'){
elem.appendChild(theWalk);
} else{
elem = theWalk;
}
}
}
return elem;
} else {
return null;
}
}
saveLibrary();
</script>
The problem is that when I run it, (edit) it takes much longer than it should and produces something like this:
<library xmlns="http://www.tuningcode.com"><LI xmlns=""/></library>.
In other words, it doesn't print any of the divs, and only one li element. I have it printing to the console quite a bit, and even with only with the amount of nodes shown above, it's printing thousands of statements to the console.
How can I traverse the tree skipping all but the div
elements? Or why is the code above not working correctly?
I think you're encountering that very long running time because you call walkLibraryTree2
twice for every iteration of your for
loop, resulting in an exponential expansion (your HTML is 13 levels deep, so that means walkLibraryTree2
is called over 8,000 times).
When working with a complicated problem, it's a good idea to break it down into smaller parts. The following seems to work:
<script>
function saveLibrary() {
var xmlDoc = document.implementation.createDocument(null, 'library');
var rootNode = document.getElementById('browserDiv');
console.log("rootNode here: " + rootNode);
appendNodes(xmlDoc.documentElement, processChildren(rootNode, xmlDoc));
var oSerializer = new XMLSerializer();
var sXML = oSerializer.serializeToString(xmlDoc);
console.log("xmlDoc: " + xmlDoc);
console.log(sXML);
}
// DomNode, Document -> Array[DomNode]
function processChildren(node, doc) {
var nodes = [],
i;
for (i = 0; i < node.childNodes.length; i += 1) {
nodes = nodes.concat(processNode(node.childNodes[i], doc));
}
return nodes;
}
// DomNode, Array[DomNode] -> void
function appendNodes(destNode, nodes) {
var i;
for (i = 0; i < nodes.length; i += 1) {
destNode.appendChild(nodes[i]);
}
}
// DomNode, Document -> Array[DomNode]
function processNode(node, doc) {
var children = processChildren(node, doc);
if (node.tagName == "DIV") {
return [createTopicElement(node, doc, children)];
} else {
return children;
}
}
// DomNode, Document, Array[DomNode] -> DomNode
function createTopicElement(baseNode, doc, children) {
var el = doc.createElement("topic"),
hasChildren = !! children.length,
id = node.id.substring(2),
encoding = node.getAttribute("data-content");
el.setAttribute("children", hasChildren ? "yes" : "no");
el.appendChild(createElementWithValue(doc, "id", id));
el.appendChild(createElementWithValue(doc, "encoding", encoding));
appendNodes(el, children);
return el;
}
// Document, String, String -> DomNode
function createElementWithValue(doc, name, value) {
var el = doc.createElement(name);
el.textContent = value;
return el;
}
saveLibrary();
</script>
This produces the XML:
<library>
<topic children="yes">
<id>0</id>
<encoding>2b-2t-38-w-2c-2w-2t-33-36-3d</encoding>
<topic children="yes">
<id>0-0</id>
<encoding>1t-3c-2x-33-31-37</encoding>
<topic children="no">
<id>0-0-0</id>
<encoding>1t-3c-2x-33-31-w-33-2u-w-2b-2t-34-2p-36-2p-38-2x-33-32</encoding>
</topic>
</topic>
</topic>
<topic children="no">
<id>1</id>
<encoding>1t-32-2p-30-3d-37-2x-37</encoding>
</topic>
</library>
I don't know how your loadable
attribute is determined, or where the title comes from, but this should get you most of the way there.