Search code examples
javascriptxmldom

DOM navigation: eliminating the text nodes


I have a js script that reads and parses XML. It obtains the XML from an XMLHttpRequest request (which contacts with a php script which returns XML). The script is supposed to receive 2 or more nodes under the first parentNode. The 2 nodes it requires have the name well defined, the other ones can be any name. The output from the php may be:

<?xml version='1.0'?>
<things>
    <carpet>
        <id>1</id>
        <name>1</name>
        <desc>1.5</desc>
    </carpet>
    <carpet>
        <id>2</id>
        <name>2</name>
        <height>unknown</height>
    </carpet>
</things>

Here all carpets have 7 nodes.

but it also may be:

<?xml version='1.0'?>
<things>
    <carpet>
        <id>1</id>
        <name>1</name>
        <desc>1.5</desc>
    </carpet>
    <carpet><id>2</id><name>2</name><height>unknown</height></carpet>
</things>

Here the first carpet has 7 nodes, the 2nd carpet has 3 nodes. I want my javascript code to treat both exactly the same way in a quick and clean way. If possible, I'd like to remove all the text nodes between each tag. So a code like the one above would always be treated as:

<?xml version='1.0'?>
    <things><carpet><id>1</id><name>1</name><desc>1.5</desc></carpet><carpet><id>2</id><name>2</name><height>unknown</height></carpet></things>

Is that possible in a quick and efficient way? I'd like not to use any get function (getElementsByTagName(), getElementById, ...), if possible and if more efficient.


Solution

  • It's pretty straightforward to walk the DOM and remove the nodes you consider empty (containing only whitespace).

    This is untested (tested and fixed, live copy here), but it would look something like this (replace those magic numbers with symbols, obviously):

    var reBlank = /^\s*$/;
    function walk(node) {
        var child, next;
        switch (node.nodeType) {
            case 3: // Text node
                if (reBlank.test(node.nodeValue)) {
                    node.parentNode.removeChild(node);
                }
                break;
            case 1: // Element node
            case 9: // Document node
                child = node.firstChild;
                while (child) {
                    next = child.nextSibling;
                    walk(child);
                    child = next;
                }
                break;
        }
    }
    walk(xmlDoc); // Where xmlDoc is your XML document instance
    

    There my definition of "blank" is anything which only has whitespace according to the JavaScript interpreter's understanding of the \s (whitespace) RegExp class. Note that some implementations have issues with \s not being inclusive enough (several Unicode "blank" characters outside the ASCII range not being matched, etc.), so be sure to test with your sample data.