Search code examples
javascriptdomcss-selectorscasperjsselectors-api

Searching element faster using document.querySelector in a large DOM


In a huge DOM with hundreds of elements, finding elements using document.querySelector("input[name='foo'][value='bar']") takes about 3-5 seconds for each element. Is there a way I can reduce this time? may be by giving full path of the element like say, document.querySelector("parent child grandchild and so on and then input[name='foo'][value='Modem']") or any other way?

I'm using CasperJS to test a large webpage and it takes really long to fetch each element and this is making my test run for an hour.. I've also tried __utils__.findOne() but the result is same 3-4 secs for each element. Since my test is focused on a very small part of the entire page, I wish if there's some way I could tell the document.querySelector to focus the element search on a particular portion of the page.

So could someone tell me whats the fastest way if any to fetch elements from a large DOM?

Update: This is how I measured the time

var init = (new Date()).getTime();
  var element=this.evaluate(function() {
        return document.querySelector('input[value='somethin'][name='somethin']');
    });
  this.echo('Time Taken :'+((new Date()).getTime() - init));

somehow the time is very high when I fetch radio buttons from the form, select elements and text boxes however returns within few milliseconds(I noticed this only today).

When I run the document.querySelector('input[value='somethin'][name='somethin']') in modern browser consoles like the chrome's , the time is less than a second.

I don't know if it has to do with the phantomjs's headless browser or something. Only for a particular page in that website, fetching elements is slowing down..

And yes, the page is very large with hundreds of thousands of elements. It's a legacy webapp thats a decade old. While on that page with IE 8 , pressing F12 to view source hangs IE for 5 minutes, but not chrome or firefox..maybe it's phantomjs's memory overload or something, rarely phantomjs crashes when I run the test on that particular page. I don't know if this info helps , but I'm not sure whats relevant.


Solution

  • General considerations

    The fastest selector would be the id selector, but even if you had ids higher up the tree, they would not get you much. As Ian pointed out in the comments, selector are parsed/evaluated right to left. It means that the engine would look up all inputs that have the matching attributes even if it is only one, and only then search up the tree to see if the previous elements match.

    I found that if you can know in what enclosing element the inputs are, you can use JavaScript DOM properties to walk over the DOM and run querySelector over a smaller part of the tree. At least in my tests, this reduces the time by more than half.

    Memory problem

    Judging by your updated question, it seems that it is really a memory problem. When you have hundreds of thousands of elements the relatively old PhantomJS WebKit engine will try to allocate enough memory. When it takes more memory than is free or even more than your machine has, the OS compensates by using swap memory on the hard disk.

    When your script tries to query an element that is currently only in swap, this query takes very long, because it has to fetch the data from the high latency hard disk which is very slow compared to memory.

    My tests run for 100k forms with one elements each in under 30 msec per query. When I increased the number of elements the execution time has grown linearly until at some point I got (by registering to onError)

    runtime error R6016
    - not enough space for thread data
    

    So I cannot reproduce your problem of 3-5 seconds per query on windows.

    Possible solutions

    1. Better hardware:

    Try to run it on a machine with more memory and see if it runs better.

    2. Reduce used memory by closing unnecessary applications

    3. Manipulate the page to reduce the memory footprint:

    1. If there are parts of the page that you don't need to test, you can simply remove them from the DOM before running the tests. If you need to test all of it, you could run multiple tests on the same page, but every time remove everything that is currently not tested.

    2. Don't load images if this is a image heavy site by setting casper.options.pageSettings.loadImages = false;.

    Test script

    var page = require('webpage').create();
    var content = "",
        max = 100000,
        i;
    
    for(i = 0; i < max; i++) {
        content += '<form id="f' + i + '"><input type="hidden" name="in' + i + '" valuate"iv' + i + '"></form>';
    }
    
    page.evaluate(function(content){
        document.body.innerHTML = content;
    }, content);
    
    console.log("FORMS ADDED");
    
    setTimeout(function(){
        var times = page.evaluate(function(max){
            var obj = {
                cssplain: 0,
                cssbyForm: 0,
                cssbyFormChild: 0,
                cssbyFormJsDomChild: 0,
                cssbyFormChildHybridChild: 0,
                cssbyFormHybridChild: 0,
                xpathplain: 0,
                xpathbyForm: 0
            },
                idx, start, el, i,
                repeat = 100;
    
            function runTest(name, obj, test) {
                var idx = Math.floor(Math.random()*max);
                var start = (new Date()).getTime();
                var el = test(idx);
                obj[name] += (new Date()).getTime() - start;
                return el;
            }
    
            for(i = 0; i < repeat; i++){
                runTest('cssplain', obj, function(idx){
                    return document.querySelector('input[name="in'+idx+'"][value="iv'+idx+'"]');
                });
    
                runTest('cssbyForm', obj, function(idx){
                    return document.querySelector('#f'+idx+' input[name="in'+idx+'"][value="iv'+idx+'"]');
                });
    
                runTest('cssbyFormChild', obj, function(idx){
                    return document.querySelector('form:nth-child('+(idx+1)+') input[name="in'+idx+'"][value="iv'+idx+'"]');
                });
    
                runTest('cssbyFormJsDomChild', obj, function(idx){
                    return document.body.children[max-1].querySelector('input[name="in'+idx+'"][value="iv'+idx+'"]');
                });
    
                runTest('cssbyFormChildHybridChild', obj, function(idx){
                    return document.querySelector('form:nth-child('+(idx+1)+')').querySelector('input[name="in'+idx+'"][value="iv'+idx+'"]');
                });
    
                runTest('cssbyFormHybridChild', obj, function(idx){
                    return document.querySelector('#f'+idx).querySelector('input[name="in'+idx+'"][value="iv'+idx+'"]');
                });
    
                runTest('xpathplain', obj, function(idx){
                    return document.evaluate('//input[@name="in'+idx+'" and @value="iv'+idx+'"]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
                });
    
                runTest('xpathbyForm', obj, function(idx){
                    return document.evaluate('//form[@id="f'+idx+'"]//input[@name="in'+idx+'" and @value="iv'+idx+'"]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
                });
            }
            for(var type in obj) {
                obj[type] /= repeat;
            }
            return obj;
        }, max);
        console.log("TIMES");
        for(var type in times) {
            console.log(type+":\t"+times[type]);
        }
        phantom.exit();
    }, 0); // just in case the content is not yet evaluated
    

    Output on my machine (nicer):

    cssbyForm:                  29.55
    cssbyFormChild:             29.97
    cssbyFormChildHybridChild:  11.51
    cssbyFormHybridChild:       10.17
    cssbyFormJsDomChild:        11.73
    cssplain:                   29.39
    xpathbyForm:                206.66
    xpathplain:                 207.05
    

    Note: I used PhantomJS directly. It should not have different results when the same technique is used in CasperJS.